Project Title : BANK CUSTOMER CHURN ANALYSIS¶

In [ ]:
 

MY PROFILE¶

In [1]:
from PIL import Image
In [2]:
img=Image.open('Customer-Churn-1.png')
In [3]:
img
Out[3]:
In [ ]:
 

INTRODUCTION TO LOGISTIC REGRESSION¶

Logistic regresion makes it possible to predict a variable that is dichotomous ,meaning a prediction that requires only two variable.eg a yes or no answer , true or false , big or small, man or woman etc.¶

The varable to be resolve is called , predicted , target or criterion ,while the varables you need to predict the ###results are,called the independent variables, or the predictors.¶

Examples of different machine learning logistic regression algorithms are ,¶

Naive Bayes¶

Logistic Regression¶

K-Nearest Neighbours¶

Support Vector Machine¶

Decision Tree¶

Bagging Decision Tree (Ensemble Learning I)¶

Boosted Decision Tree (Ensemble Learning II)¶

Random Forest (Ensemble Learning III)¶

Voting Classification (Ensemble Learning IV)¶

Neural Network (Deep Learning) etc and we will be using some of them in this project¶

AIMS AND OBJECTIVES OF CARRYING OUT CUSTOMER CHURN ANALYSIS:¶

Customer churn is the rate at which customers unsbscribe or leave a company .It can be for a virety of reason like ,¶

unpleasant customer service , increase in cost beared by customers , un updated technologies and many other reasons.¶

Customers are inevitably the most important part of any organisation, the lost of a customer simply means reduction in¶

revenue , which can lead to slow business growth in a company or business and in most cases , constant increase in customer¶

churn, may lead to outright closure of the company or business.¶

Hence customer churn analysis and building the right model for a business to avert customer churn ,is a critical and very¶

important aspect of every company ,knowing the right reason a company customer churn and very early can help the company ,¶

adress the issue effectively and stop any future customers churn timely.¶

customer churn also help companies to be able to segment customers in groups reduce marketing expences , by understanding how to target the right potential customers through targeted marketing etc , it also gives companies ,opportunity to understand their customers needs better.¶

PROBLEM STATEMENT:¶

An old bank have discovered that , every month, the number of V.I.P , cutomers unsubscribing and leaving their bank monthly , despite the bank reducing intrest rate to encourage customers and also introducing a lot of new automatic bank transaction packages at no extra cost ,to encourage customers to stay with the bank ,have no effect , as more customers are leaving the bank, than the number of new customers subscribing to the bank .¶

Further preliminary analsis shows the following¶

1, The cost of getting new customers to subscribe to the bank is 25% more expensive than retaining old customers,¶

2, Most new customers that subscribe, replacing the V.I.P. customers are average customers with low income ,hence low savings balance in their account. which is generally reducing the total income generated by the bank monthly.¶

Through another analysis the bank also determined that ,if this customer churn trend continue at this rate monthly , the bank may become bankrupt in 24 months peeriod and needed a fast and permanent solution to these higlighted problems.¶

The bank have contacted me to determine the following and advice them accordignly:¶

a, Why are customers churning on the increase monthly ,especially the V.I.P. customers¶

b,Which insights can be derived from my analysis to highlight the main problems¶

c, Advice the bank managers or management on how to solve this problems , based on insights derived from the customer transaction data provided to me , by the bank.¶

d, The bank also want me to build a solution model ,that can be used by the bank to prevent this issue in the future.¶

BENEFICARIES OF THIS ANALYSIS¶

1, SALES DEPARTMENT¶

2, CUSTOMER SERVICE¶

3,MARKETING DEPARTMENT¶

4,PRODUCTION : GOODS/SERVICES¶

5, LOGISTICS¶

6, COMPANY MANAGEMENT¶

APPROACH :¶

1, UNDERSTANDING THE PROBLEM STATEMENT¶

2, GATHERING OF NECESSARY DATA FOR ANALYSIS AND MODEL DEPLOYMENT¶

3, STUDYING AND UNDERSTANDING THE DATA¶

4, CLEANING / WRANGLING THE DATA¶

5, QUERRYING AND PERFORMING ANALYSIS ON THE DATA WITH VISUALIZATIONS¶

6, FUTURE ENGINEERING / DATA PREPROCESSING¶

7, SPLITING THE DATA¶

8,APPLYING RELEVANT LOGISTIC REGRESSION MACHINE LEARNING ALGORITHMS¶

9, DETERMINING THE RESULTS BEST ACCURACY¶

10 BUILDING AND DEPLOYING THE MODEL FOR FUTURE PRODUCTION¶

1, IMPORTING THE PYTHON LYBRARIES AND DATASET :¶

In [4]:
import pandas as pd
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import itertools
import missingno as msno
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
pd.options.display.float_format = '{:.2f}'.format
import warnings
warnings.filterwarnings('ignore')        
In [5]:
churn1=pd.read_csv('C:/Users/HP/Downloads/churn_modelling1.csv')
                  
In [6]:
churn1.head()
Out[6]:
RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts
0 1.00 15634602.00 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00
1 2.00 NaN Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00
2 3.00 15619304.00 Onio 502.00 France Female 42.00 8.00 159660.80 3.00
3 4.00 NaN Boni 699.00 France Female 39.00 1.00 0.00 2.00
4 5.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [7]:
churn2=pd.read_csv('C:/Users/HP/Downloads/churn_modelling2.csv') 
In [8]:
churn2.head()
Out[8]:
Row_Number Has_CrCard ISACTIVEMEMBER Estimated Salary Exited
0 1.00 1.00 1.00 101348.88 1.00
1 2.00 0.00 1.00 112542.58 0.00
2 3.00 1.00 0.00 113931.57 1.00
3 4.00 0.00 0.00 93826.63 0.00
4 NaN NaN NaN NaN NaN

COLUMN NAME CONSISTENT¶

In [514]:
# USEING RENAME FUNCTION TO MAKE ALL COLUMNS NAME CONSISTENT
churn2.columns
Out[514]:
Index(['surname', 'creditscore', 'geography', 'gender', 'age', 'tenure',
       'balance', 'numofproducts', 'isactivemember', 'estimatedsalary',
       'exited'],
      dtype='object')
In [10]:
churn2.rename(columns={'Row_Number':'RowNumber','Has_CrCard':'HasCrCard'},inplace = True)
In [11]:
churn2.rename(columns={'    Estimated  Salary':'  estimated salary'},inplace=True)
In [12]:
churn2.columns
Out[12]:
Index(['RowNumber', 'HasCrCard', 'ISACTIVEMEMBER', '  estimated salary',
       'Exited'],
      dtype='object')
In [13]:
churn2.head()
Out[13]:
RowNumber HasCrCard ISACTIVEMEMBER estimated salary Exited
0 1.00 1.00 1.00 101348.88 1.00
1 2.00 0.00 1.00 112542.58 0.00
2 3.00 1.00 0.00 113931.57 1.00
3 4.00 0.00 0.00 93826.63 0.00
4 NaN NaN NaN NaN NaN

USE STR REPLACE TO REPLACE SPACES BEFORE OR AFTER THE COLUMN NAME¶

In [516]:
churn2.columns=churn2.columns.str.replace(' ','')

WE CAN REPLACE COLUMN HEAD WITH THIS METHOD¶

In [15]:
churn2.columns
Out[15]:
Index(['RowNumber', 'HasCrCard', 'ISACTIVEMEMBER', 'estimatedsalary',
       'Exited'],
      dtype='object')
In [16]:
churn2_col=['rownumber', 'hascrcard', 'isactivemember', 'estimatedsalary','exited']
In [17]:
churn2.columns=churn2_col
In [18]:
churn2.head(2)
Out[18]:
rownumber hascrcard isactivemember estimatedsalary exited
0 1.00 1.00 1.00 101348.88 1.00
1 2.00 0.00 1.00 112542.58 0.00
In [19]:
churn1_cols=['rownumber','customerid','surname','creditscore','geography','gender','age','tenure','balance','numofproducts']
In [20]:
churn1.columns=churn1_cols
In [21]:
churn1.columns
Out[21]:
Index(['rownumber', 'customerid', 'surname', 'creditscore', 'geography',
       'gender', 'age', 'tenure', 'balance', 'numofproducts'],
      dtype='object')
In [22]:
churn1.head()
Out[22]:
rownumber customerid surname creditscore geography gender age tenure balance numofproducts
0 1.00 15634602.00 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00
1 2.00 NaN Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00
2 3.00 15619304.00 Onio 502.00 France Female 42.00 8.00 159660.80 3.00
3 4.00 NaN Boni 699.00 France Female 39.00 1.00 0.00 2.00
4 5.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [23]:
churn1=pd.read_csv('C:/Users/HP/Downloads/churn_modelling1.csv',names=churn1_cols,header=0)
In [24]:
churn1.head()
Out[24]:
rownumber customerid surname creditscore geography gender age tenure balance numofproducts
0 1.00 15634602.00 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00
1 2.00 NaN Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00
2 3.00 15619304.00 Onio 502.00 France Female 42.00 8.00 159660.80 3.00
3 4.00 NaN Boni 699.00 France Female 39.00 1.00 0.00 2.00
4 5.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN

ADD THE TWO DATA SET INTO A NEW VARIABLE NAME CHURN¶

MERGE FUNCTION¶

In [25]:
churn=pd.merge(churn1,churn2,on='rownumber',how='left')

START TO STUDY THE DATA FOR BETTER UNDERSTANDING¶

In [ ]:
 
In [26]:
#TO SHOW THE 5 FIRST ROWS OF THE DATASET
churn.head()
Out[26]:
rownumber customerid surname creditscore geography gender age tenure balance numofproducts hascrcard isactivemember estimatedsalary exited
0 1.00 15634602.00 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 1.00 101348.88 1.00
1 2.00 NaN Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 0.00 1.00 112542.58 0.00
2 3.00 15619304.00 Onio 502.00 France Female 42.00 8.00 159660.80 3.00 1.00 0.00 113931.57 1.00
3 4.00 NaN Boni 699.00 France Female 39.00 1.00 0.00 2.00 0.00 0.00 93826.63 0.00
4 5.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [27]:
churn
Out[27]:
rownumber customerid surname creditscore geography gender age tenure balance numofproducts hascrcard isactivemember estimatedsalary exited
0 1.00 15634602.00 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 1.00 101348.88 1.00
1 2.00 NaN Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 0.00 1.00 112542.58 0.00
2 3.00 15619304.00 Onio 502.00 France Female 42.00 8.00 159660.80 3.00 1.00 0.00 113931.57 1.00
3 4.00 NaN Boni 699.00 France Female 39.00 1.00 0.00 2.00 0.00 0.00 93826.63 0.00
4 5.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10027 NaN NaN Johnstone 516.00 France Male 35.00 10.00 57369.61 1.00 NaN NaN NaN NaN
10028 NaN NaN Liu 709.00 France Female 36.00 7.00 0.00 1.00 NaN NaN NaN NaN
10029 NaN NaN Liu 709.00 France Female 36.00 7.00 0.00 1.00 NaN NaN NaN NaN
10030 9999.00 NaN Sabbatini 772.00 Germany Male 42.00 3.00 75075.31 2.00 1.00 0.00 92888.52 1.00
10031 10000.00 15628319.00 Walker 792.00 France Female 28.00 4.00 130142.79 1.00 1.00 0.00 38190.78 0.00

10032 rows × 14 columns

In [28]:
# TO SHOW THE LAST 5 ROWS OF THE DATASET
churn.tail()
Out[28]:
rownumber customerid surname creditscore geography gender age tenure balance numofproducts hascrcard isactivemember estimatedsalary exited
10027 NaN NaN Johnstone 516.00 France Male 35.00 10.00 57369.61 1.00 NaN NaN NaN NaN
10028 NaN NaN Liu 709.00 France Female 36.00 7.00 0.00 1.00 NaN NaN NaN NaN
10029 NaN NaN Liu 709.00 France Female 36.00 7.00 0.00 1.00 NaN NaN NaN NaN
10030 9999.00 NaN Sabbatini 772.00 Germany Male 42.00 3.00 75075.31 2.00 1.00 0.00 92888.52 1.00
10031 10000.00 15628319.00 Walker 792.00 France Female 28.00 4.00 130142.79 1.00 1.00 0.00 38190.78 0.00
In [29]:
churn.index
Out[29]:
Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            10022, 10023, 10024, 10025, 10026, 10027, 10028, 10029, 10030,
            10031],
           dtype='int64', length=10032)
In [30]:
churn.sample
Out[30]:
<bound method NDFrame.sample of        rownumber  customerid    surname  creditscore geography  gender   age  \
0           1.00 15634602.00   Hargrave       619.00    France  Female 42.00   
1           2.00         NaN       Hill       608.00     Spain  Female 41.00   
2           3.00 15619304.00       Onio       502.00    France  Female 42.00   
3           4.00         NaN       Boni       699.00    France  Female 39.00   
4           5.00         NaN        NaN          NaN       NaN     NaN   NaN   
...          ...         ...        ...          ...       ...     ...   ...   
10027        NaN         NaN  Johnstone       516.00    France    Male 35.00   
10028        NaN         NaN        Liu       709.00    France  Female 36.00   
10029        NaN         NaN        Liu       709.00    France  Female 36.00   
10030    9999.00         NaN  Sabbatini       772.00   Germany    Male 42.00   
10031   10000.00 15628319.00     Walker       792.00    France  Female 28.00   

       tenure   balance  numofproducts  hascrcard  isactivemember  \
0        2.00      0.00           1.00       1.00            1.00   
1        1.00  83807.86           1.00       0.00            1.00   
2        8.00 159660.80           3.00       1.00            0.00   
3        1.00      0.00           2.00       0.00            0.00   
4         NaN       NaN            NaN        NaN             NaN   
...       ...       ...            ...        ...             ...   
10027   10.00  57369.61           1.00        NaN             NaN   
10028    7.00      0.00           1.00        NaN             NaN   
10029    7.00      0.00           1.00        NaN             NaN   
10030    3.00  75075.31           2.00       1.00            0.00   
10031    4.00 130142.79           1.00       1.00            0.00   

       estimatedsalary  exited  
0            101348.88    1.00  
1            112542.58    0.00  
2            113931.57    1.00  
3             93826.63    0.00  
4                  NaN     NaN  
...                ...     ...  
10027              NaN     NaN  
10028              NaN     NaN  
10029              NaN     NaN  
10030         92888.52    1.00  
10031         38190.78    0.00  

[10032 rows x 14 columns]>
In [31]:
churn.sample(frac=0.5)
Out[31]:
rownumber customerid surname creditscore geography gender age tenure balance numofproducts hascrcard isactivemember estimatedsalary exited
2793 2787.00 15585100.00 Rioux 511.00 Germany Female 40.00 9.00 124401.60 1.00 1.00 0.00 198814.24 1.00
752 746.00 15640059.00 Smith 606.00 France Male 40.00 5.00 0.00 2.00 1.00 1.00 70899.27 0.00
3262 3256.00 15671387.00 Fetherstonhaugh 507.00 France Female 29.00 4.00 89349.47 2.00 0.00 0.00 180626.68 0.00
6666 6650.00 15635277.00 Coates 605.00 Spain Male 47.00 7.00 142643.54 1.00 1.00 0.00 189310.27 0.00
4668 4662.00 15670416.00 Ferri 780.00 France Female 43.00 0.00 0.00 1.00 0.00 1.00 15705.27 0.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7524 7508.00 15642001.00 Lorenzen 576.00 Germany Male 44.00 9.00 119530.52 1.00 1.00 0.00 119056.68 1.00
8007 7991.00 15643635.00 Robertson 664.00 Spain Male 32.00 5.00 133705.74 1.00 0.00 0.00 134455.84 0.00
9916 9900.00 15811594.00 Gordon 660.00 Spain Female 28.00 3.00 128929.88 1.00 1.00 1.00 198069.71 0.00
3519 3513.00 NaN Boylan 806.00 Spain Male 18.00 3.00 0.00 2.00 1.00 1.00 86994.54 0.00
7472 7456.00 15748499.00 Johnson 550.00 Germany Male 33.00 4.00 118400.91 1.00 0.00 1.00 13999.64 1.00

5016 rows × 14 columns

In [33]:
churn.sample(7)
Out[33]:
rownumber customerid surname creditscore geography gender age tenure balance numofproducts hascrcard isactivemember estimatedsalary exited
1970 1964.00 15771139.00 Douglas 578.00 Germany Male 34.00 8.00 147487.23 2.00 1.00 0.00 66680.77 0.00
3998 3992.00 15601659.00 Fiorentino 496.00 Germany Female 59.00 7.00 91680.10 2.00 1.00 0.00 163141.18 1.00
4513 4507.00 15635177.00 Williamson 597.00 Spain Female 66.00 3.00 0.00 1.00 1.00 1.00 70532.53 0.00
8960 8944.00 15727350.00 Pai 516.00 France Female 37.00 8.00 113143.12 1.00 0.00 0.00 3363.36 0.00
9790 9774.00 15698462.00 Chiu 532.00 France Male 36.00 4.00 0.00 2.00 1.00 1.00 132798.78 0.00
3666 3660.00 15664668.00 Zarate 534.00 France Female 42.00 9.00 144801.97 1.00 0.00 1.00 12483.39 1.00
1860 1854.00 15633574.00 Montes 730.00 France Female 41.00 4.00 167545.32 1.00 1.00 0.00 128246.81 0.00
In [34]:
churn.sample(7,random_state=0)
Out[34]:
rownumber customerid surname creditscore geography gender age tenure balance numofproducts hascrcard isactivemember estimatedsalary exited
7109 7093.00 15723884.00 Nekrasova 758.00 Spain Male 40.00 3.00 0.00 2.00 0.00 0.00 96097.65 0.00
3847 3841.00 15662884.00 Naylor 739.00 Germany Male 58.00 1.00 110597.76 1.00 0.00 1.00 160122.66 1.00
5957 5941.00 15791958.00 Mazzi 849.00 France Female 41.00 6.00 0.00 2.00 1.00 1.00 169203.51 1.00
6787 6771.00 15815295.00 John 662.00 France Female 38.00 2.00 96479.81 1.00 1.00 0.00 120259.41 0.00
2849 2843.00 15785782.00 Ugonna 513.00 Spain Male 48.00 2.00 0.00 1.00 1.00 1.00 114709.13 1.00
3523 3517.00 15787151.00 Liao 638.00 France Female 34.00 7.00 0.00 2.00 1.00 1.00 198969.78 0.00
6287 6271.00 15702806.00 Martin 696.00 Spain Male 24.00 9.00 0.00 1.00 0.00 0.00 10883.52 0.00
In [35]:
churn
Out[35]:
rownumber customerid surname creditscore geography gender age tenure balance numofproducts hascrcard isactivemember estimatedsalary exited
0 1.00 15634602.00 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 1.00 101348.88 1.00
1 2.00 NaN Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 0.00 1.00 112542.58 0.00
2 3.00 15619304.00 Onio 502.00 France Female 42.00 8.00 159660.80 3.00 1.00 0.00 113931.57 1.00
3 4.00 NaN Boni 699.00 France Female 39.00 1.00 0.00 2.00 0.00 0.00 93826.63 0.00
4 5.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10027 NaN NaN Johnstone 516.00 France Male 35.00 10.00 57369.61 1.00 NaN NaN NaN NaN
10028 NaN NaN Liu 709.00 France Female 36.00 7.00 0.00 1.00 NaN NaN NaN NaN
10029 NaN NaN Liu 709.00 France Female 36.00 7.00 0.00 1.00 NaN NaN NaN NaN
10030 9999.00 NaN Sabbatini 772.00 Germany Male 42.00 3.00 75075.31 2.00 1.00 0.00 92888.52 1.00
10031 10000.00 15628319.00 Walker 792.00 France Female 28.00 4.00 130142.79 1.00 1.00 0.00 38190.78 0.00

10032 rows × 14 columns

In [518]:
#LETS VISUALIZE THE DISTRIBUTION OF ANY COLUMN JUST TO SEE HOW THE COLUMNS ARE SPREAD
churn['creditscore'].sample(frac=0.001).plot.bar()
Out[518]:
<AxesSubplot:>
In [519]:
churn.columns
Out[519]:
Index(['creditscore', 'age', 'tenure', 'balance', 'numofproducts',
       'isactivemember', 'estimatedsalary', 'exited', 'geography1',
       'geography_Germany', 'geography_Spain', 'gender_Male'],
      dtype='object')
In [520]:
list(churn.columns)
Out[520]:
['creditscore',
 'age',
 'tenure',
 'balance',
 'numofproducts',
 'isactivemember',
 'estimatedsalary',
 'exited',
 'geography1',
 'geography_Germany',
 'geography_Spain',
 'gender_Male']
In [521]:
churn.size
Out[521]:
119676
In [522]:
churn.shape
Out[522]:
(9973, 12)
In [523]:
print('number of columns',churn.shape[1])
number of columns 12
In [524]:
print('number of rows',churn.shape[0])
number of rows 9973
In [43]:
churn.dtypes
Out[43]:
rownumber          float64
customerid         float64
surname             object
creditscore        float64
geography           object
gender              object
age                float64
tenure             float64
balance            float64
numofproducts      float64
hascrcard          float64
isactivemember     float64
estimatedsalary    float64
exited             float64
dtype: object

lets seperate objects from numerical variables¶

In [525]:
churn_catg=churn.select_dtypes('object')
In [45]:
churn['geography'].value_counts().plot.bar()
Out[45]:
<AxesSubplot:>
In [46]:
churn_cont=churn.select_dtypes('number')
In [47]:
churn_catg
Out[47]:
surname geography gender
0 Hargrave France Female
1 Hill Spain Female
2 Onio France Female
3 Boni France Female
4 NaN NaN NaN
... ... ... ...
10027 Johnstone France Male
10028 Liu France Female
10029 Liu France Female
10030 Sabbatini Germany Male
10031 Walker France Female

10032 rows × 3 columns

In [48]:
churn_cont
Out[48]:
rownumber customerid creditscore age tenure balance numofproducts hascrcard isactivemember estimatedsalary exited
0 1.00 15634602.00 619.00 42.00 2.00 0.00 1.00 1.00 1.00 101348.88 1.00
1 2.00 NaN 608.00 41.00 1.00 83807.86 1.00 0.00 1.00 112542.58 0.00
2 3.00 15619304.00 502.00 42.00 8.00 159660.80 3.00 1.00 0.00 113931.57 1.00
3 4.00 NaN 699.00 39.00 1.00 0.00 2.00 0.00 0.00 93826.63 0.00
4 5.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
10027 NaN NaN 516.00 35.00 10.00 57369.61 1.00 NaN NaN NaN NaN
10028 NaN NaN 709.00 36.00 7.00 0.00 1.00 NaN NaN NaN NaN
10029 NaN NaN 709.00 36.00 7.00 0.00 1.00 NaN NaN NaN NaN
10030 9999.00 NaN 772.00 42.00 3.00 75075.31 2.00 1.00 0.00 92888.52 1.00
10031 10000.00 15628319.00 792.00 28.00 4.00 130142.79 1.00 1.00 0.00 38190.78 0.00

10032 rows × 11 columns

In [526]:
#UNIQUE VALUES SHOWS THE TYPE OF VALUE IN EACH COLUMN
churn['estimatedsalary'].nunique()
Out[526]:
9970
In [527]:
churn['estimatedsalary'].unique
Out[527]:
<bound method Series.unique of 0       101348.88
1       112542.58
2       113931.57
3        93826.63
5       149756.71
           ...   
9997     99595.67
10006    96833.00
10025    96270.64
10030    92888.52
10031    38190.78
Name: estimatedsalary, Length: 9973, dtype: float64>
In [51]:
churn['estimatedsalary'].unique()[:20]
Out[51]:
array([101348.88, 112542.58, 113931.57,  93826.63,       nan, 149756.71,
       119346.88,  74940.5 ,  71725.73,  80181.12,  76390.01,  26260.98,
       190857.79,  65951.65,  64327.26,   5097.67,  14406.41, 158684.81,
        54724.03, 170886.17])

CLEANING THE DATA¶

check if the data vave duplicates¶

In [529]:
churn.duplicated()
Out[529]:
0        False
1        False
2        False
3        False
5        False
         ...  
9997     False
10006    False
10025    False
10030    False
10031    False
Length: 9973, dtype: bool
In [53]:
churn.duplicated().any()
Out[53]:
True
In [54]:
churn_dup=churn.duplicated().any()
In [55]:
print(churn_dup)
True
In [56]:
churn[churn.duplicated()]
Out[56]:
rownumber customerid surname creditscore geography gender age tenure balance numofproducts hascrcard isactivemember estimatedsalary exited
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 14.00 15691483.00 Chin 549.00 France Female 25.00 5.00 0.00 2.00 0.00 0.00 190857.79 0.00
16 14.00 15691483.00 Chin 549.00 France Female 25.00 5.00 0.00 2.00 0.00 0.00 190857.79 0.00
17 14.00 15691483.00 Chin 549.00 France Female 25.00 5.00 0.00 2.00 0.00 0.00 190857.79 0.00
19 15.00 15600882.00 Scott 635.00 Spain Female 35.00 7.00 0.00 2.00 1.00 1.00 65951.65 0.00
20 15.00 15600882.00 Scott 635.00 Spain Female 35.00 7.00 0.00 2.00 1.00 1.00 65951.65 0.00
21 15.00 15600882.00 Scott 635.00 Spain Female 35.00 7.00 0.00 2.00 1.00 1.00 65951.65 0.00
5278 NaN 15607230.00 Michel 588.00 Germany Male 33.00 9.00 150186.22 2.00 NaN NaN NaN NaN
5280 NaN 15567630.00 Bruce 721.00 Germany Male 40.00 6.00 100275.88 1.00 NaN NaN NaN NaN
5282 NaN 15587507.00 Feng 850.00 France Male 47.00 6.00 0.00 1.00 NaN NaN NaN NaN
5284 NaN 15733904.00 McDonald 529.00 France Male 32.00 9.00 147493.89 1.00 NaN NaN NaN NaN
5286 NaN NaN Watt 622.00 France Male 43.00 8.00 0.00 2.00 NaN NaN NaN NaN
5288 NaN NaN Goodwin 683.00 France Female 42.00 8.00 0.00 2.00 NaN NaN NaN NaN
5290 NaN NaN Arcuri 575.00 France Male 29.00 4.00 121823.40 2.00 NaN NaN NaN NaN
5292 NaN NaN Rapuluolisa 577.00 Spain Female 40.00 1.00 0.00 2.00 NaN NaN NaN NaN
5294 NaN NaN Kao 510.00 Germany Male 31.00 0.00 113688.63 1.00 NaN NaN NaN NaN
5296 NaN NaN Tsui 652.00 Spain Female 40.00 7.00 100471.34 1.00 NaN NaN NaN NaN
9999 NaN NaN Burbidge 498.00 Germany Male 42.00 3.00 152039.70 1.00 NaN NaN NaN NaN
10001 NaN NaN Griffin 655.00 Germany Female 46.00 7.00 137145.12 1.00 NaN NaN NaN NaN
10003 NaN NaN Cocci 613.00 France Male 40.00 4.00 0.00 1.00 NaN NaN NaN NaN
10005 NaN 15696175.00 Echezonachukwu 602.00 Germany Male 35.00 7.00 90602.42 2.00 NaN NaN NaN NaN
10008 NaN NaN Bartlett 673.00 Germany Male 47.00 1.00 183579.54 2.00 NaN NaN NaN NaN
10010 NaN NaN Mancini 606.00 Spain Male 30.00 8.00 180307.73 2.00 NaN NaN NaN NaN
10012 NaN NaN Pirozzi 775.00 France Male 30.00 4.00 0.00 2.00 NaN NaN NaN NaN
10014 NaN NaN McMillan 841.00 Spain Male 28.00 4.00 0.00 2.00 NaN NaN NaN NaN
10016 NaN 15798964.00 Nkemakonam 714.00 Germany Male 33.00 3.00 35016.60 1.00 NaN NaN NaN NaN
10018 NaN NaN Ajuluchukwu 597.00 France Female 53.00 4.00 88381.21 1.00 NaN NaN NaN NaN
10020 NaN NaN Chukwualuka 726.00 Spain Male 36.00 2.00 0.00 1.00 NaN NaN NaN NaN
10022 NaN NaN Rahman 644.00 France Male 28.00 7.00 155060.41 1.00 NaN NaN NaN NaN
10024 NaN NaN Wood 800.00 France Female 29.00 2.00 0.00 2.00 NaN NaN NaN NaN
10027 NaN NaN Johnstone 516.00 France Male 35.00 10.00 57369.61 1.00 NaN NaN NaN NaN
10029 NaN NaN Liu 709.00 France Female 36.00 7.00 0.00 1.00 NaN NaN NaN NaN
In [57]:
churn.drop_duplicates(inplace=True)
In [58]:
churn_dup=churn.duplicated().any()
In [59]:
print(churn_dup)
False
In [60]:
churn.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 10031
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   rownumber        9974 non-null   float64
 1   customerid       8515 non-null   float64
 2   surname          9998 non-null   object 
 3   creditscore      9998 non-null   float64
 4   geography        9998 non-null   object 
 5   gender           9998 non-null   object 
 6   age              9998 non-null   float64
 7   tenure           9998 non-null   float64
 8   balance          9998 non-null   float64
 9   numofproducts    9998 non-null   float64
 10  hascrcard        9973 non-null   float64
 11  isactivemember   9973 non-null   float64
 12  estimatedsalary  9973 non-null   float64
 13  exited           9973 non-null   float64
dtypes: float64(11), object(3)
memory usage: 1.1+ MB
In [61]:
churn.count()
Out[61]:
rownumber          9974
customerid         8515
surname            9998
creditscore        9998
geography          9998
gender             9998
age                9998
tenure             9998
balance            9998
numofproducts      9998
hascrcard          9973
isactivemember     9973
estimatedsalary    9973
exited             9973
dtype: int64

check for missing values¶

In [62]:
churn.isnull().sum()
Out[62]:
rownumber            26
customerid         1485
surname               2
creditscore           2
geography             2
gender                2
age                   2
tenure                2
balance               2
numofproducts         2
hascrcard            27
isactivemember       27
estimatedsalary      27
exited               27
dtype: int64
In [63]:
churn.isnull().sum().plot.bar()
Out[63]:
<AxesSubplot:>
In [64]:
churn.isnull().sum().sum()
Out[64]:
1635
In [65]:
churn_missing=churn[churn.isnull().any(1)]
In [66]:
churn_missing
Out[66]:
rownumber customerid surname creditscore geography gender age tenure balance numofproducts hascrcard isactivemember estimatedsalary exited
1 2.00 NaN Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 0.00 1.00 112542.58 0.00
3 4.00 NaN Boni 699.00 France Female 39.00 1.00 0.00 2.00 0.00 0.00 93826.63 0.00
4 5.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 6.00 NaN Chu 645.00 Spain Male 44.00 8.00 113755.78 2.00 1.00 0.00 149756.71 1.00
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10023 NaN NaN Wood 800.00 France Female 29.00 2.00 0.00 2.00 NaN NaN NaN NaN
10025 9996.00 NaN Obijiaku 771.00 France Male 39.00 5.00 0.00 2.00 1.00 0.00 96270.64 0.00
10026 NaN NaN Johnstone 516.00 France Male 35.00 10.00 57369.61 1.00 NaN NaN NaN NaN
10028 NaN NaN Liu 709.00 France Female 36.00 7.00 0.00 1.00 NaN NaN NaN NaN
10030 9999.00 NaN Sabbatini 772.00 Germany Male 42.00 3.00 75075.31 2.00 1.00 0.00 92888.52 1.00

1491 rows × 14 columns

In [67]:
churn.notnull()
Out[67]:
rownumber customerid surname creditscore geography gender age tenure balance numofproducts hascrcard isactivemember estimatedsalary exited
0 True True True True True True True True True True True True True True
1 True False True True True True True True True True True True True True
2 True True True True True True True True True True True True True True
3 True False True True True True True True True True True True True True
4 True False False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10025 True False True True True True True True True True True True True True
10026 False False True True True True True True True True False False False False
10028 False False True True True True True True True True False False False False
10030 True False True True True True True True True True True True True True
10031 True True True True True True True True True True True True True True

10000 rows × 14 columns

In [68]:
churn_catg.isnull().sum()
Out[68]:
surname      3
geography    3
gender       3
dtype: int64
In [69]:
churn_catg.isnull().sum().plot.bar()
Out[69]:
<AxesSubplot:>
In [70]:
churn_cont.isnull().sum()
Out[70]:
rownumber            52
customerid         1505
creditscore           3
age                   3
tenure                3
balance               3
numofproducts         3
hascrcard            53
isactivemember       53
estimatedsalary      53
exited               53
dtype: int64
In [71]:
churn_cont.isnull().sum().plot.bar()
Out[71]:
<AxesSubplot:>

lets calculate the amount of missing values in percentage of the entire dataset¶

In [72]:
(len(churn))
Out[72]:
10000
In [73]:
(len(churn))*100
Out[73]:
1000000
In [74]:
churn_percentage=churn.isnull().sum()/(len(churn))*100
In [75]:
churn_percentage
Out[75]:
rownumber          0.26
customerid        14.85
surname            0.02
creditscore        0.02
geography          0.02
gender             0.02
age                0.02
tenure             0.02
balance            0.02
numofproducts      0.02
hascrcard          0.27
isactivemember     0.27
estimatedsalary    0.27
exited             0.27
dtype: float64
In [76]:
(churn_percentage)*round(4)
Out[76]:
rownumber          1.04
customerid        59.40
surname            0.08
creditscore        0.08
geography          0.08
gender             0.08
age                0.08
tenure             0.08
balance            0.08
numofproducts      0.08
hascrcard          1.08
isactivemember     1.08
estimatedsalary    1.08
exited             1.08
dtype: float64

VISUALIZING MISSING VALUES¶

In [77]:
import missingno as msn
msn.matrix(churn)
Out[77]:
<AxesSubplot:>
In [78]:
msn.bar(churn)
Out[78]:
<AxesSubplot:>
In [79]:
msn.dendrogram(churn)
Out[79]:
<AxesSubplot:>
In [80]:
msn.heatmap(churn)
Out[80]:
<AxesSubplot:>
In [81]:
churn_nullvalue = pd.DataFrame((churn.isnull().sum())*100/churn.shape[0]).reset_index()
churn_nullvalue.columns = ['Column Name', 'Null Values Percentage']
fig = plt.figure(figsize=(18,6))
ax = sns.pointplot(x="Column Name",y="Null Values Percentage",data=churn_nullvalue,color='green')
plt.xticks(rotation =90,fontsize =8)
ax.axhline(40, ls='--',color='red')
plt.title("Percentage of Missing values in application data")
plt.ylabel("Null Values PERCENTAGE")
plt.xlabel("COLUMNS")
plt.show()
In [82]:
sns.heatmap(churn.isnull())
Out[82]:
<AxesSubplot:>
In [83]:
churn.isnull().sum()
Out[83]:
rownumber            26
customerid         1485
surname               2
creditscore           2
geography             2
gender                2
age                   2
tenure                2
balance               2
numofproducts         2
hascrcard            27
isactivemember       27
estimatedsalary      27
exited               27
dtype: int64
In [84]:
churn=churn.drop(labels=['customerid'],axis=1)
In [85]:
churn.isnull().sum()
Out[85]:
rownumber          26
surname             2
creditscore         2
geography           2
gender              2
age                 2
tenure              2
balance             2
numofproducts       2
hascrcard          27
isactivemember     27
estimatedsalary    27
exited             27
dtype: int64
In [86]:
churn=churn.drop(labels=['rownumber','hascrcard'],axis=1)
In [87]:
churn.isnull().sum()
Out[87]:
surname             2
creditscore         2
geography           2
gender              2
age                 2
tenure              2
balance             2
numofproducts       2
isactivemember     27
estimatedsalary    27
exited             27
dtype: int64
In [88]:
churn.columns
Out[88]:
Index(['surname', 'creditscore', 'geography', 'gender', 'age', 'tenure',
       'balance', 'numofproducts', 'isactivemember', 'estimatedsalary',
       'exited'],
      dtype='object')
In [89]:
len(churn.columns)
Out[89]:
11
In [90]:
churn_cont.isnull().sum().plot.bar()
Out[90]:
<AxesSubplot:>

INPUTATION¶

In [91]:
churn_cont
Out[91]:
rownumber customerid creditscore age tenure balance numofproducts hascrcard isactivemember estimatedsalary exited
0 1.00 15634602.00 619.00 42.00 2.00 0.00 1.00 1.00 1.00 101348.88 1.00
1 2.00 NaN 608.00 41.00 1.00 83807.86 1.00 0.00 1.00 112542.58 0.00
2 3.00 15619304.00 502.00 42.00 8.00 159660.80 3.00 1.00 0.00 113931.57 1.00
3 4.00 NaN 699.00 39.00 1.00 0.00 2.00 0.00 0.00 93826.63 0.00
4 5.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
10027 NaN NaN 516.00 35.00 10.00 57369.61 1.00 NaN NaN NaN NaN
10028 NaN NaN 709.00 36.00 7.00 0.00 1.00 NaN NaN NaN NaN
10029 NaN NaN 709.00 36.00 7.00 0.00 1.00 NaN NaN NaN NaN
10030 9999.00 NaN 772.00 42.00 3.00 75075.31 2.00 1.00 0.00 92888.52 1.00
10031 10000.00 15628319.00 792.00 28.00 4.00 130142.79 1.00 1.00 0.00 38190.78 0.00

10032 rows × 11 columns

In [92]:
churn.isnull().sum()
Out[92]:
surname             2
creditscore         2
geography           2
gender              2
age                 2
tenure              2
balance             2
numofproducts       2
isactivemember     27
estimatedsalary    27
exited             27
dtype: int64
In [93]:
churn.isnull().sum().sum()
Out[93]:
97
In [94]:
mean_churn1=churn['estimatedsalary'].mean()
In [95]:
mean_churn1
Out[95]:
100093.52722149782
In [96]:
churn['estimatedsalary'].fillna(mean_churn1,inplace=True)
In [97]:
churn.isnull().sum()
Out[97]:
surname             2
creditscore         2
geography           2
gender              2
age                 2
tenure              2
balance             2
numofproducts       2
isactivemember     27
estimatedsalary     0
exited             27
dtype: int64
In [98]:
churn.isnull().sum().sum()
Out[98]:
70
In [99]:
churn['isactivemember'].unique()
Out[99]:
array([ 1.,  0., nan])
In [100]:
churn.isnull().sum()
Out[100]:
surname             2
creditscore         2
geography           2
gender              2
age                 2
tenure              2
balance             2
numofproducts       2
isactivemember     27
estimatedsalary     0
exited             27
dtype: int64
In [101]:
churn.isnull().sum().sum()
Out[101]:
70
In [102]:
churn_cont.head(2)
Out[102]:
rownumber customerid creditscore age tenure balance numofproducts hascrcard isactivemember estimatedsalary exited
0 1.00 15634602.00 619.00 42.00 2.00 0.00 1.00 1.00 1.00 101348.88 1.00
1 2.00 NaN 608.00 41.00 1.00 83807.86 1.00 0.00 1.00 112542.58 0.00
In [103]:
#drop all categorical or object or string missing values
churn.loc[:,['surname','geography',]].dropna()
Out[103]:
surname geography
0 Hargrave France
1 Hill Spain
2 Onio France
3 Boni France
5 Chu Spain
... ... ...
10025 Obijiaku France
10026 Johnstone France
10028 Liu France
10030 Sabbatini Germany
10031 Walker France

9998 rows × 2 columns

In [104]:
churn.isnull().sum()
Out[104]:
surname             2
creditscore         2
geography           2
gender              2
age                 2
tenure              2
balance             2
numofproducts       2
isactivemember     27
estimatedsalary     0
exited             27
dtype: int64
In [105]:
churn.gender.unique()
Out[105]:
array(['Female', nan, 'Male'], dtype=object)
In [106]:
len(churn.gender.unique())
Out[106]:
3
In [107]:
churn_mod=churn.gender.value_counts()
In [108]:
churn_mod
Out[108]:
Male      5456
Female    4542
Name: gender, dtype: int64
In [109]:
#mode is male
churn['gender'].fillna('Male',inplace=True)
In [110]:
churn['gender'].fillna('churn_mod',inplace=True)
In [111]:
churn.isnull().sum()
Out[111]:
surname             2
creditscore         2
geography           2
gender              0
age                 2
tenure              2
balance             2
numofproducts       2
isactivemember     27
estimatedsalary     0
exited             27
dtype: int64
In [112]:
churn.isnull().sum().sum()
Out[112]:
68
In [113]:
churn.dropna(inplace=True)
In [114]:
churn.isnull().sum()
Out[114]:
surname            0
creditscore        0
geography          0
gender             0
age                0
tenure             0
balance            0
numofproducts      0
isactivemember     0
estimatedsalary    0
exited             0
dtype: int64
In [115]:
churn.isnull().sum().sum()
Out[115]:
0

VISUALIZING THE DATA AFTER TAKING CARE OF THE MISSING VALUES¶

In [116]:
#check new visuals
msn.matrix(churn)
Out[116]:
<AxesSubplot:>
In [117]:
msn.bar(churn)
Out[117]:
<AxesSubplot:>
In [118]:
msn.dendrogram(churn)
Out[118]:
<AxesSubplot:>
In [119]:
msn.heatmap(churn)
Out[119]:
<AxesSubplot:>
In [120]:
churn_nullvalue = pd.DataFrame((churn.isnull().sum())*100/churn.shape[0]).reset_index()
churn_nullvalue.columns = ['Column Name', 'Null Values Percentage']
fig = plt.figure(figsize=(18,6))
ax = sns.pointplot(x="Column Name",y="Null Values Percentage",data=churn_nullvalue,color='green')
plt.xticks(rotation =90,fontsize =8)
ax.axhline(40, ls='--',color='red')
plt.title("Percentage of Missing values in application data")
plt.ylabel("Null Values PERCENTAGE")
plt.xlabel("COLUMNS")
plt.show()
In [121]:
sns.heatmap(churn.isnull())
Out[121]:
<AxesSubplot:>

OUTLIERS¶

In [122]:
churn.head(2)
Out[122]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
In [123]:
churn.describe()
Out[123]:
creditscore age tenure balance numofproducts isactivemember estimatedsalary exited
count 9973.00 9973.00 9973.00 9973.00 9973.00 9973.00 9973.00 9973.00
mean 650.49 38.93 5.01 76498.54 1.53 0.52 100093.53 0.20
std 96.63 10.50 2.89 62383.80 0.58 0.50 57494.40 0.40
min 350.00 18.00 0.00 0.00 1.00 0.00 11.58 0.00
25% 584.00 32.00 3.00 0.00 1.00 0.00 51099.56 0.00
50% 652.00 37.00 5.00 97234.58 1.00 1.00 100187.43 0.00
75% 717.00 44.00 7.00 127642.44 2.00 1.00 149381.32 0.00
max 850.00 92.00 10.00 250898.09 4.00 1.00 199992.48 1.00
In [124]:
sns.distplot(churn['balance'],kde=False,hist=True,bins=12)
plt.title('balance distribution',size=16)
plt.ylabel('count')
Out[124]:
Text(0, 0.5, 'count')
In [125]:
churn.boxplot()
Out[125]:
<AxesSubplot:>
In [126]:
sns.distplot(churn['balance'])
Out[126]:
<AxesSubplot:xlabel='balance', ylabel='Density'>
In [127]:
churn['estimatedsalary'].plot(kind='box',figsize=(3,4),patch_artist=True)
Out[127]:
<AxesSubplot:>
In [128]:
sns.boxplot(x='gender',y='age',data=churn)
Out[128]:
<AxesSubplot:xlabel='gender', ylabel='age'>

DETECT OUTLIERS¶

Z-SCORE METHOD¶

In [129]:
churn['balance'].plot(kind='box',figsize=(3,4),patch_artist=True)
Out[129]:
<AxesSubplot:>
In [130]:
churn['age'].unique()
Out[130]:
array([42., 41., 39., 44., 29., 27., 31., 24., 34., 25., 35., 45., 58.,
       32., 38., 46., 36., 43., 33., 40., 51., 61., 49., 37., 19., 66.,
       56., 26., 21., 55., 75., 22., 30., 28., 65., 48., 52., 50., 57.,
       73., 47., 54., 72., 20., 67., 79., 62., 53., 80., 59., 68., 23.,
       60., 70., 63., 64., 18., 82., 69., 74., 71., 76., 77., 88., 85.,
       84., 78., 81., 92., 83.])
In [131]:
sns.boxplot(y='age',data=churn)
Out[131]:
<AxesSubplot:ylabel='age'>
In [132]:
churn['age'].mean()
Out[132]:
38.9250977639627
In [133]:
churn1=churn[churn['age']<=70]
In [134]:
sns.boxplot(y='age',data=churn1)
Out[134]:
<AxesSubplot:ylabel='age'>
In [135]:
churn2=churn[churn['age']<=58]
In [136]:
sns.boxplot(y='age',data=churn2)
Out[136]:
<AxesSubplot:ylabel='age'>
In [ ]:
 
In [137]:
sns.boxplot(y='balance',data=churn)
Out[137]:
<AxesSubplot:ylabel='balance'>
In [138]:
churn2=churn
In [139]:
churn['balance'].mean()
Out[139]:
76498.54009425409
In [140]:
churn.shape
Out[140]:
(9973, 11)
In [141]:
churn.boxplot()
Out[141]:
<AxesSubplot:>
In [142]:
churn.hist()
Out[142]:
array([[<AxesSubplot:title={'center':'creditscore'}>,
        <AxesSubplot:title={'center':'age'}>,
        <AxesSubplot:title={'center':'tenure'}>],
       [<AxesSubplot:title={'center':'balance'}>,
        <AxesSubplot:title={'center':'numofproducts'}>,
        <AxesSubplot:title={'center':'isactivemember'}>],
       [<AxesSubplot:title={'center':'estimatedsalary'}>,
        <AxesSubplot:title={'center':'exited'}>, <AxesSubplot:>]],
      dtype=object)

DETECT AND REMOVE OUTLIERS¶

Data points, that falls outside of 1.5 times of the interquartile range above the 3rd or below the 1st quartile¶

is an outlier.¶

In [143]:
churn.describe()
Out[143]:
creditscore age tenure balance numofproducts isactivemember estimatedsalary exited
count 9973.00 9973.00 9973.00 9973.00 9973.00 9973.00 9973.00 9973.00
mean 650.49 38.93 5.01 76498.54 1.53 0.52 100093.53 0.20
std 96.63 10.50 2.89 62383.80 0.58 0.50 57494.40 0.40
min 350.00 18.00 0.00 0.00 1.00 0.00 11.58 0.00
25% 584.00 32.00 3.00 0.00 1.00 0.00 51099.56 0.00
50% 652.00 37.00 5.00 97234.58 1.00 1.00 100187.43 0.00
75% 717.00 44.00 7.00 127642.44 2.00 1.00 149381.32 0.00
max 850.00 92.00 10.00 250898.09 4.00 1.00 199992.48 1.00
In [144]:
churn.shape
Out[144]:
(9973, 11)

BOXPLOT¶

In [145]:
#DEFINE A FUNCTION CALLED plot_box 
#and there will be two arguments call it df and ft
#and plot the boxplot with the boxplot function in pandas
def plot_boxplot(churn1,ft):
    churn.boxplot(column=[ft])
    plt.grid(False)
    plt.show()
In [146]:
#lets call it
plot_boxplot(churn,'creditscore')
In [147]:
sns.boxplot(y='age',data=churn1)
Out[147]:
<AxesSubplot:ylabel='age'>
In [148]:
sns.boxplot(y='creditscore',data=churn1)
Out[148]:
<AxesSubplot:ylabel='creditscore'>
In [149]:
plot_boxplot(churn,'tenure')
In [150]:
plot_boxplot(churn,'balance')
In [151]:
plot_boxplot(churn,'numofproducts')
In [152]:
churn1=churn[churn['numofproducts']<=3.5]
In [153]:
plot_boxplot(churn1,'numofproducts')
In [154]:
plot_boxplot(churn,'estimatedsalary')

REMOVE OUTLIERS¶

In [155]:
#FIRST WE WILL EXTRAT ALL OUTLIERS IN ALL COLUMN 
# SECONDLY WE USE THE INDEXES OF THE OUTLIERS TO REMOVE THEM FROM THE DATAFRAME  BY CREATING A LIST 
#DEFINE A FUNCTION CALLED OULIERS
# IQR =Q3 -  Q1 
# +/- 1.5 * IQR
# SET THE UPPER AND LOWER BOUNDRIES
In [156]:
Q1=churn.creditscore.quantile(0.25)
Q3=churn.creditscore.quantile(0.75)
In [157]:
Q1,Q3
Out[157]:
(584.0, 717.0)
In [158]:
plt.hist(churn['creditscore'],)
plt.title('distribution of total creditscore')
plt.show()
In [159]:
sns.distplot(churn['creditscore'])
Out[159]:
<AxesSubplot:xlabel='creditscore', ylabel='Density'>
In [160]:
IQR=Q3-Q1 
In [161]:
IQR
Out[161]:
133.0
In [162]:
lower_limit = Q1-(1.5*IQR)
In [163]:
lower_limit
Out[163]:
384.5
In [164]:
upper_limit=Q3+(1.5*IQR)
In [165]:
upper_limit
Out[165]:
916.5
In [166]:
lower_limit, upper_limit
Out[166]:
(384.5, 916.5)
In [167]:
churn[(churn.creditscore<lower_limit)|(churn.creditscore>upper_limit)]
Out[167]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
8 Obinna 376.00 Germany Female 29.00 4.00 115046.74 4.00 0.00 119346.88 1.00
949 Lin 376.00 France Female 46.00 6.00 0.00 1.00 0.00 157333.69 1.00
1200 Thomas 363.00 Spain Female 28.00 6.00 146098.43 3.00 0.00 100615.14 1.00
1412 Panicucci 359.00 France Female 44.00 6.00 128747.69 1.00 0.00 146955.71 1.00
1638 Azubuike 350.00 Spain Male 54.00 1.00 152677.48 1.00 1.00 191973.49 1.00
1845 Campbell 350.00 Germany Male 39.00 0.00 109733.20 2.00 0.00 123602.11 1.00
1969 Aikenhead 358.00 Spain Female 52.00 8.00 143542.36 3.00 0.00 141959.11 1.00
2480 Chou 351.00 Germany Female 57.00 4.00 163146.46 1.00 0.00 169621.69 1.00
2586 Ozoemena 365.00 Germany Male 30.00 0.00 127760.07 1.00 0.00 81537.85 1.00
5511 Ijendu 383.00 Spain Female 48.00 8.00 95808.19 1.00 0.00 137702.01 1.00
8171 Ch'ien 367.00 Spain Male 42.00 6.00 93608.28 1.00 0.00 168816.73 1.00
8740 Onyekachi 350.00 France Male 51.00 10.00 0.00 1.00 1.00 125823.79 1.00
8779 Lin 350.00 France Female 60.00 3.00 0.00 1.00 0.00 113796.15 1.00
9227 Watts 382.00 Spain Male 36.00 0.00 0.00 1.00 1.00 179540.73 1.00
9373 Loggia 373.00 France Male 42.00 7.00 0.00 1.00 0.00 77786.37 1.00
9641 Maslow 350.00 France Female 40.00 0.00 111098.85 1.00 1.00 172321.21 1.00
In [168]:
(churn['creditscore']<916.5)
Out[168]:
0        True
1        True
2        True
3        True
5        True
         ... 
9997     True
10006    True
10025    True
10030    True
10031    True
Name: creditscore, Length: 9973, dtype: bool
In [169]:
(churn['creditscore']>384.5)
Out[169]:
0        True
1        True
2        True
3        True
5        True
         ... 
9997     True
10006    True
10025    True
10030    True
10031    True
Name: creditscore, Length: 9973, dtype: bool
In [170]:
churn2=churn[(churn['creditscore']<916.5)&(churn['creditscore']>384.5)]
In [171]:
churn2['creditscore'].plot(kind='box', figsize=(3,4), patch_artist=True)
Out[171]:
<AxesSubplot:>
In [172]:
churn2.shape
Out[172]:
(9957, 11)
In [173]:
plt.hist(churn2['creditscore'])
plt.title('distribution of total creditscore')
plt.show()
In [174]:
sns.distplot(churn2['creditscore'])
Out[174]:
<AxesSubplot:xlabel='creditscore', ylabel='Density'>
In [175]:
churn2=churn
In [176]:
#cross check by visualizing the box plot
plot_boxplot(churn,'creditscore')
In [177]:
plot_boxplot(churn,'age')
In [178]:
#SAVE THE NEW DATA FRAME IN PANDAS
churn.head(2)
Out[178]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
In [179]:
churn.shape
Out[179]:
(9973, 11)
In [180]:
churn.describe()
Out[180]:
creditscore age tenure balance numofproducts isactivemember estimatedsalary exited
count 9973.00 9973.00 9973.00 9973.00 9973.00 9973.00 9973.00 9973.00
mean 650.49 38.93 5.01 76498.54 1.53 0.52 100093.53 0.20
std 96.63 10.50 2.89 62383.80 0.58 0.50 57494.40 0.40
min 350.00 18.00 0.00 0.00 1.00 0.00 11.58 0.00
25% 584.00 32.00 3.00 0.00 1.00 0.00 51099.56 0.00
50% 652.00 37.00 5.00 97234.58 1.00 1.00 100187.43 0.00
75% 717.00 44.00 7.00 127642.44 2.00 1.00 149381.32 0.00
max 850.00 92.00 10.00 250898.09 4.00 1.00 199992.48 1.00
In [181]:
churn.boxplot()
Out[181]:
<AxesSubplot:>
In [182]:
sns.pairplot(churn)
Out[182]:
<seaborn.axisgrid.PairGrid at 0x1b564768b20>

EXPLORATORY DATA ANALYSIS¶

In [183]:
churn.corr()
Out[183]:
creditscore age tenure balance numofproducts isactivemember estimatedsalary exited
creditscore 1.00 -0.00 0.00 0.01 0.01 0.03 -0.00 -0.03
age -0.00 1.00 -0.01 0.03 -0.03 0.09 -0.01 0.29
tenure 0.00 -0.01 1.00 -0.01 0.01 -0.03 0.01 -0.01
balance 0.01 0.03 -0.01 1.00 -0.30 -0.01 0.01 0.12
numofproducts 0.01 -0.03 0.01 -0.30 1.00 0.01 0.01 -0.05
isactivemember 0.03 0.09 -0.03 -0.01 0.01 1.00 -0.01 -0.16
estimatedsalary -0.00 -0.01 0.01 0.01 0.01 -0.01 1.00 0.01
exited -0.03 0.29 -0.01 0.12 -0.05 -0.16 0.01 1.00
In [184]:
plt.figure(figsize=(20,20))
sns.heatmap(churn.corr(), vmin=-1, cmap="plasma_r", annot=True)
#same thing can be seen from the correlation as well
Out[184]:
<AxesSubplot:>
In [185]:
churn.head(2)
Out[185]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00

form the corrolation heatmap it shows exited which is the target variable is highly corrolated with the following independent variable, estimated salary , balance and age.¶

In [ ]:
 

EXPLORATORY DATA ANALYSIS¶

QUERRYING THE DATASET¶

Q1,WHAT IS THE MEAN, MIN , AND MAX VALUE OF estimatedsalary¶
GROUPBY¶
In [186]:
churn.groupby('estimatedsalary').mean()
Out[186]:
creditscore age tenure balance numofproducts isactivemember exited
estimatedsalary
11.58 709.00 45.00 4.00 122917.71 1.00 1.00 1.00
90.07 625.00 31.00 5.00 0.00 2.00 1.00 0.00
91.75 645.00 59.00 8.00 121669.93 2.00 0.00 1.00
96.27 710.00 38.00 2.00 0.00 2.00 0.00 0.00
106.67 629.00 40.00 9.00 0.00 1.00 0.00 0.00
... ... ... ... ... ... ... ...
199909.32 620.00 28.00 8.00 0.00 2.00 1.00 0.00
199929.17 569.00 37.00 9.00 178755.84 1.00 0.00 0.00
199953.33 608.00 27.00 4.00 153325.10 1.00 1.00 0.00
199970.74 639.00 41.00 5.00 98635.77 1.00 0.00 0.00
199992.48 685.00 42.00 2.00 0.00 2.00 0.00 0.00

9970 rows × 7 columns

In [ ]:
 
In [187]:
churn.groupby('creditscore').min()
Out[187]:
surname geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
creditscore
350.00 Azubuike France Female 39.00 0.00 0.00 1.00 0.00 113796.15 1.00
351.00 Chou Germany Female 57.00 4.00 163146.46 1.00 0.00 169621.69 1.00
358.00 Aikenhead Spain Female 52.00 8.00 143542.36 3.00 0.00 141959.11 1.00
359.00 Panicucci France Female 44.00 6.00 128747.69 1.00 0.00 146955.71 1.00
363.00 Thomas Spain Female 28.00 6.00 146098.43 3.00 0.00 100615.14 1.00
... ... ... ... ... ... ... ... ... ... ...
846.00 Baxter France Female 34.00 0.00 0.00 1.00 0.00 3440.47 0.00
847.00 Beam France Female 31.00 3.00 0.00 1.00 0.00 16025.17 0.00
848.00 Hung Germany Female 22.00 3.00 90018.45 1.00 0.00 30876.84 0.00
849.00 Chou France Female 27.00 0.00 0.00 1.00 0.00 17294.12 0.00
850.00 Abbott France Female 19.00 0.00 0.00 1.00 0.00 705.18 0.00

460 rows × 10 columns

Q2, FIND ALL INSTANCES WHEN CREDITSCORE IS GREATER THAN OR EQUALL TO 600 OR GEOGRAPHY IS FRANCE¶

In [ ]:
 
In [188]:
churn[(churn['creditscore']>=600)|(churn['geography']=='france')]
Out[188]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
3 Boni 699.00 France Female 39.00 1.00 0.00 2.00 0.00 93826.63 0.00
5 Chu 645.00 Spain Male 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00
10 H? 684.00 France Male 27.00 2.00 134603.88 1.00 1.00 71725.73 0.00
... ... ... ... ... ... ... ... ... ... ... ...
9997 T'ao 741.00 Spain Male 35.00 6.00 74371.49 1.00 0.00 99595.67 0.00
10006 Nepean 659.00 France Male 36.00 6.00 123841.49 2.00 0.00 96833.00 0.00
10025 Obijiaku 771.00 France Male 39.00 5.00 0.00 2.00 0.00 96270.64 0.00
10030 Sabbatini 772.00 Germany Male 42.00 3.00 75075.31 2.00 0.00 92888.52 1.00
10031 Walker 792.00 France Female 28.00 4.00 130142.79 1.00 0.00 38190.78 0.00

6947 rows × 11 columns

In [ ]:
 
In [189]:
list[churn[(churn['creditscore']<600)|(churn['geography']=='france')]]
Out[189]:
list[       surname  creditscore geography  gender   age  tenure   balance  \
2         Onio       502.00    France  Female 42.00    8.00 159660.80   
8       Obinna       376.00   Germany  Female 29.00    4.00 115046.74   
9           He       501.00    France    Male 44.00    4.00 142051.07   
11      Bearce       528.00    France    Male 31.00    6.00 102016.72   
12     Andrews       497.00     Spain    Male 24.00    3.00      0.00   
...        ...          ...       ...     ...   ...     ...       ...   
9981   Douglas       479.00    France    Male 34.00    5.00 117593.48   
9983     Ch'en       479.00     Spain    Male 35.00    4.00 125920.98   
9985  McGregor       512.00   Germany  Female 40.00    8.00 153537.57   
9987  Thompson       518.00    France    Male 42.00    7.00 151027.05   
9991    Barker       583.00    France    Male 33.00    7.00 122531.86   

      numofproducts  isactivemember  estimatedsalary  exited  
2              3.00            0.00        113931.57    1.00  
8              4.00            0.00        119346.88    1.00  
9              2.00            1.00         74940.50    0.00  
11             2.00            0.00         80181.12    0.00  
12             2.00            0.00         76390.01    0.00  
...             ...             ...              ...     ...  
9981           2.00            0.00        113308.29    0.00  
9983           1.00            1.00         20393.44    0.00  
9985           2.00            0.00         23101.13    0.00  
9987           2.00            0.00        119377.36    0.00  
9991           1.00            0.00         13549.24    0.00  

[3026 rows x 11 columns]]

Q3 SHOW ALL RECORDS WHERE CREDITSTORE IS 502.0 OR 518.0¶

IS IN¶
In [190]:
churn['creditscore'].isin(['502.0','518.0'])
Out[190]:
0        False
1        False
2        False
3        False
5        False
         ...  
9997     False
10006    False
10025    False
10030    False
10031    False
Name: creditscore, Length: 9973, dtype: bool

QUESTION 4 FIND TOTAL NUMBER OF RECORDS HAVING CREDITSCORE OF 502 TO 518 INCLUSIVE USING BETWEEN METHOD¶

In [191]:
[churn['creditscore']>=502]
Out[191]:
[0        True
 1        True
 2        True
 3        True
 5        True
          ... 
 9997     True
 10006    True
 10025    True
 10030    True
 10031    True
 Name: creditscore, Length: 9973, dtype: bool]
In [192]:
[churn['creditscore']<=518]
Out[192]:
[0        False
 1        False
 2         True
 3        False
 5        False
          ...  
 9997     False
 10006    False
 10025    False
 10030    False
 10031    False
 Name: creditscore, Length: 9973, dtype: bool]
In [193]:
sum(churn['creditscore'].between(502,518))
Out[193]:
271
In [194]:
churn.head(2)
Out[194]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00

geography¶

Q5 how many people have spain as thier geographical location¶

In [195]:
churn['geography']
Out[195]:
0         France
1          Spain
2         France
3         France
5          Spain
          ...   
9997       Spain
10006     France
10025     France
10030    Germany
10031     France
Name: geography, Length: 9973, dtype: object
In [196]:
churn['geography'].str.contains('spain',case=False)
Out[196]:
0        False
1         True
2        False
3        False
5         True
         ...  
9997      True
10006    False
10025    False
10030    False
10031    False
Name: geography, Length: 9973, dtype: bool
In [197]:
churn[churn['geography'].str.contains('spain',case=False)]
Out[197]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
5 Chu 645.00 Spain Male 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00
12 Andrews 497.00 Spain Male 24.00 3.00 0.00 2.00 0.00 76390.01 0.00
18 Scott 635.00 Spain Female 35.00 7.00 0.00 2.00 1.00 65951.65 0.00
24 Henderson 549.00 Spain Female 24.00 9.00 0.00 2.00 1.00 14406.41 0.00
... ... ... ... ... ... ... ... ... ... ... ...
9976 Graham 764.00 Spain Male 38.00 4.00 113607.47 1.00 0.00 91094.46 0.00
9978 Aldridge 573.00 Spain Male 44.00 9.00 0.00 2.00 0.00 107124.17 0.00
9979 Flynn 702.00 Spain Male 44.00 9.00 0.00 1.00 0.00 59207.41 1.00
9983 Ch'en 479.00 Spain Male 35.00 4.00 125920.98 1.00 1.00 20393.44 0.00
9997 T'ao 741.00 Spain Male 35.00 6.00 74371.49 1.00 0.00 99595.67 0.00

2471 rows × 11 columns

In [198]:
len(churn['geography'].str.contains('spain',case=False))
Out[198]:
9973

Q6 TOTAL NUMBER OF RECORDS IN GEOGRAPHY¶

In [199]:
churn['geography'].nunique()
Out[199]:
3
In [200]:
churn.head(2)
Out[200]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00

Q7 WHICH GEOGRAPHY HAS THE HIGEST CREDITSCORE AND WHICH GEOGRAPHY HAVE THE HIGHEST AVERAGE RATING¶

In [201]:
churn.groupby('geography')['creditscore'].max()
Out[201]:
geography
France    850.00
Germany   850.00
Spain     850.00
Name: creditscore, dtype: float64
In [202]:
churn.groupby('geography')['creditscore'].mean()
Out[202]:
geography
France    649.61
Germany   651.55
Spain     651.19
Name: creditscore, dtype: float64
In [203]:
churn.groupby('geography')['creditscore'].mean().sort_values(ascending=False)
Out[203]:
geography
Germany   651.55
Spain     651.19
France    649.61
Name: creditscore, dtype: float64
In [204]:
churn.head(2)
Out[204]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00

Q8 TOTAL NUMBER OF GEOGRAPHY HAVING 2 TENURE¶

In [205]:
churn.columns
Out[205]:
Index(['surname', 'creditscore', 'geography', 'gender', 'age', 'tenure',
       'balance', 'numofproducts', 'isactivemember', 'estimatedsalary',
       'exited'],
      dtype='object')
In [206]:
churn['tenure']
Out[206]:
0       2.00
1       1.00
2       8.00
3       1.00
5       8.00
        ... 
9997    6.00
10006   6.00
10025   5.00
10030   3.00
10031   4.00
Name: tenure, Length: 9973, dtype: float64
In [207]:
churn['tenure']==2
Out[207]:
0         True
1        False
2        False
3        False
5        False
         ...  
9997     False
10006    False
10025    False
10030    False
10031    False
Name: tenure, Length: 9973, dtype: bool
In [208]:
churn[churn['tenure']==2]
Out[208]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
10 H? 684.00 France Male 27.00 2.00 134603.88 1.00 1.00 71725.73 0.00
33 Young 756.00 Germany Male 36.00 2.00 136815.64 1.00 1.00 170041.95 0.00
49 Osborne 556.00 France Female 61.00 2.00 117419.35 1.00 1.00 94153.83 0.00
50 Lavine 834.00 France Female 49.00 2.00 131394.56 1.00 0.00 194365.76 1.00
... ... ... ... ... ... ... ... ... ... ... ...
9922 Emenike 645.00 Germany Male 41.00 2.00 93925.30 1.00 0.00 123982.14 1.00
9926 Endrizzi 739.00 France Male 58.00 2.00 101579.28 1.00 1.00 72168.53 0.00
9960 Fu 784.00 Spain Male 23.00 2.00 0.00 1.00 1.00 6847.73 0.00
9970 Burke 550.00 France Male 47.00 2.00 0.00 2.00 1.00 97057.28 0.00
9980 Scott 740.00 Germany Male 33.00 2.00 126524.11 1.00 0.00 136869.31 0.00

1045 rows × 11 columns

In [209]:
len(churn[churn['tenure']==2])
Out[209]:
1045

GENDER¶

Q9 FIND TOTAL NUMBER OF MALE AND FEMALE¶

In [210]:
churn.head(2)
Out[210]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
In [211]:
churn['gender'].value_counts()
Out[211]:
Male      5438
Female    4535
Name: gender, dtype: int64

Q10 WHICH GENDER HAVE THE HIGEST BALANCE¶

In [212]:
churn['balance'].max()
Out[212]:
250898.09
In [213]:
churn[churn['balance'].max()==churn['balance']]['gender']
Out[213]:
2099    Male
Name: gender, dtype: object

QUESTION11, DISPLAY TOP 5 NAMES HAVING MAXIMUM NUMBER OF SALARY¶

In [214]:
churn.head(2)
Out[214]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
In [215]:
churn_index=churn['estimatedsalary'].sort_values(ascending=False).head().index
In [216]:
churn.iloc[churn_index]
Out[216]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
6692 Christmas 654.00 Spain Female 36.00 5.00 0.00 2.00 0.00 157238.05 0.00
3596 Watts 446.00 France Female 51.00 4.00 105056.13 1.00 0.00 70613.52 0.00
4433 Chao 707.00 France Female 44.00 6.00 0.00 2.00 1.00 192542.17 0.00
9142 Iheanacho 749.00 Spain Male 34.00 2.00 0.00 1.00 0.00 174189.04 1.00
8972 Afanasyeva 518.00 Spain Male 50.00 4.00 0.00 1.00 0.00 107112.25 1.00

QUESTION12 WHAT IS THE RECORDS OF THE SURNAME WATTS¶

In [217]:
churn.head(2)
Out[217]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
In [218]:
churn['surname']
Out[218]:
0         Hargrave
1             Hill
2             Onio
3             Boni
5              Chu
           ...    
9997          T'ao
10006       Nepean
10025     Obijiaku
10030    Sabbatini
10031       Walker
Name: surname, Length: 9973, dtype: object
In [219]:
churn['surname'].isin(['Watts'])
Out[219]:
0        False
1        False
2        False
3        False
5        False
         ...  
9997     False
10006    False
10025    False
10030    False
10031    False
Name: surname, Length: 9973, dtype: bool
In [220]:
churn[churn['surname'].isin(['Watts'])]
Out[220]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
529 Watts 527.00 Spain Male 37.00 5.00 93722.73 2.00 1.00 139093.73 0.00
3596 Watts 446.00 France Female 51.00 4.00 105056.13 1.00 0.00 70613.52 0.00
9227 Watts 382.00 Spain Male 36.00 0.00 0.00 1.00 1.00 179540.73 1.00
9635 Watts 581.00 France Female 24.00 3.00 95508.20 1.00 1.00 45755.00 0.00
9826 Watts 820.00 France Female 39.00 1.00 104614.29 1.00 0.00 61538.43 1.00
In [221]:
# WE CAN ALSO USE STR CONTAINS
churn[churn['surname'].str.contains('Watts')]
Out[221]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
529 Watts 527.00 Spain Male 37.00 5.00 93722.73 2.00 1.00 139093.73 0.00
3596 Watts 446.00 France Female 51.00 4.00 105056.13 1.00 0.00 70613.52 0.00
9227 Watts 382.00 Spain Male 36.00 0.00 0.00 1.00 1.00 179540.73 1.00
9635 Watts 581.00 France Female 24.00 3.00 95508.20 1.00 1.00 45755.00 0.00
9826 Watts 820.00 France Female 39.00 1.00 104614.29 1.00 0.00 61538.43 1.00

AGE¶

Q13 DISPLAY TOP 5 OLDEST GENDER¶

In [222]:
churn['age'].unique
Out[222]:
<bound method Series.unique of 0       42.00
1       41.00
2       42.00
3       39.00
5       44.00
         ... 
9997    35.00
10006   36.00
10025   39.00
10030   42.00
10031   28.00
Name: age, Length: 9973, dtype: float64>
In [223]:
churn['age'].sort_values(ascending=False)
Out[223]:
6776   92.00
6460   92.00
2465   88.00
3040   85.00
3538   84.00
        ... 
9949   18.00
2143   18.00
2148   18.00
4563   18.00
2028   18.00
Name: age, Length: 9973, dtype: float64
In [224]:
churn['age'].sort_values(ascending=False).head()
Out[224]:
6776   92.00
6460   92.00
2465   88.00
3040   85.00
3538   84.00
Name: age, dtype: float64
In [225]:
index=churn['age'].sort_values(ascending=False).head().index
In [226]:
churn.iloc[index]
Out[226]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
6805 Costa 825.00 Germany Male 37.00 6.00 118050.79 1.00 1.00 52301.15 0.00
6489 Johnston 651.00 Germany Female 34.00 10.00 148962.46 1.00 0.00 66389.43 1.00
2474 Yuryeva 617.00 France Female 27.00 4.00 0.00 2.00 0.00 190269.21 0.00
3049 Mahon 835.00 Germany Female 29.00 10.00 130420.20 2.00 0.00 106276.55 0.00
3547 Crawford 672.00 France Male 43.00 5.00 0.00 1.00 0.00 63833.09 0.00

Q14 GIVE ALL THE NAMES THAT HAVE AGE 50¶

In [227]:
churn['age']
Out[227]:
0       42.00
1       41.00
2       42.00
3       39.00
5       44.00
         ... 
9997    35.00
10006   36.00
10025   39.00
10030   42.00
10031   28.00
Name: age, Length: 9973, dtype: float64
In [228]:
churn[churn['age']==50.0]
Out[228]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
142 Arthur 589.00 Germany Female 50.00 5.00 144895.05 2.00 1.00 34941.23 0.00
191 Cocci 804.00 France Female 50.00 4.00 0.00 1.00 1.00 8546.87 1.00
209 He 516.00 Spain Male 50.00 5.00 0.00 1.00 1.00 146145.93 1.00
235 O'Sullivan 751.00 Germany Male 50.00 2.00 96888.39 1.00 0.00 77206.25 1.00
315 Balashov 614.00 France Male 50.00 4.00 137104.47 1.00 0.00 127166.49 1.00
... ... ... ... ... ... ... ... ... ... ... ...
9362 Miah 582.00 France Male 50.00 2.00 148942.00 1.00 1.00 116944.30 0.00
9655 Baldwin 744.00 Germany Female 50.00 1.00 121498.11 2.00 1.00 106061.47 1.00
9899 Okeke 771.00 France Male 50.00 3.00 105229.72 1.00 1.00 16281.68 1.00
9941 Palerma 632.00 Germany Female 50.00 5.00 107959.39 1.00 1.00 6985.34 1.00
9992 Smith 610.00 Germany Male 50.00 1.00 113957.01 2.00 0.00 196526.55 1.00

133 rows × 11 columns

In [229]:
len(churn[churn['age']==50.0])
Out[229]:
133

Q15 SHOW ALL THE RECORDS THAT HAVE BALANCE ABOVE 83807.86¶

BALANCE¶

In [230]:
churn.head(2)
Out[230]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
In [231]:
churn['balance']>=83807.86
Out[231]:
0        False
1         True
2         True
3        False
5         True
         ...  
9997     False
10006     True
10025    False
10030    False
10031     True
Name: balance, Length: 9973, dtype: bool
In [232]:
churn[churn['balance']>=83807.86]
Out[232]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
2 Onio 502.00 France Female 42.00 8.00 159660.80 3.00 0.00 113931.57 1.00
5 Chu 645.00 Spain Male 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00
8 Obinna 376.00 Germany Female 29.00 4.00 115046.74 4.00 0.00 119346.88 1.00
9 He 501.00 France Male 44.00 4.00 142051.07 2.00 1.00 74940.50 0.00
... ... ... ... ... ... ... ... ... ... ... ...
9993 Azikiwe 637.00 France Female 33.00 7.00 103377.81 1.00 0.00 84419.78 0.00
9995 P'eng 774.00 France Male 40.00 9.00 93017.47 2.00 0.00 191608.97 0.00
9996 Diribe 677.00 France Female 58.00 1.00 90022.85 1.00 1.00 2988.28 0.00
10006 Nepean 659.00 France Male 36.00 6.00 123841.49 2.00 0.00 96833.00 0.00
10031 Walker 792.00 France Female 28.00 4.00 130142.79 1.00 0.00 38190.78 0.00

5627 rows × 11 columns

In [233]:
churn['balance'].unique()
Out[233]:
array([     0.  ,  83807.86, 159660.8 , ..., 123841.49,  75075.31,
       130142.79])

QUESTION 16 SHOW ONLY THE BALANCE OF ALL GENDER FEMALE WHO LIVES IN FRANCE ONLY¶

In [234]:
(churn['gender']=='Female')
Out[234]:
0         True
1         True
2         True
3         True
5        False
         ...  
9997     False
10006    False
10025    False
10030    False
10031     True
Name: gender, Length: 9973, dtype: bool
In [235]:
(churn['geography']=='France')
Out[235]:
0         True
1        False
2         True
3         True
5        False
         ...  
9997     False
10006     True
10025     True
10030    False
10031     True
Name: geography, Length: 9973, dtype: bool
In [236]:
(churn['gender']=='Female')&(churn['geography']=='France')
Out[236]:
0         True
1        False
2         True
3         True
5        False
         ...  
9997     False
10006    False
10025    False
10030    False
10031     True
Length: 9973, dtype: bool
In [237]:
churn[(churn['gender']=='Female')&(churn['geography']=='France')]['balance']
Out[237]:
0            0.00
2       159660.80
3            0.00
13           0.00
14           0.00
           ...   
9988    144751.81
9993    103377.81
9994         0.00
9996     90022.85
10031   130142.79
Name: balance, Length: 2257, dtype: float64

QUESTION 17 SHOW ALL RECORDS WERE GENDER IS MALE AND GEOGRAPHY IS SPAIN OR BALANCE IS 159660.8¶

In [238]:
churn.head(2)
Out[238]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
In [239]:
churn['gender'].unique()
Out[239]:
array(['Female', 'Male'], dtype=object)
In [240]:
(churn['gender']=='Male')
Out[240]:
0        False
1        False
2        False
3        False
5         True
         ...  
9997      True
10006     True
10025     True
10030     True
10031    False
Name: gender, Length: 9973, dtype: bool
In [241]:
(churn['geography']=='Spain')
Out[241]:
0        False
1         True
2        False
3        False
5         True
         ...  
9997      True
10006    False
10025    False
10030    False
10031    False
Name: geography, Length: 9973, dtype: bool
In [242]:
(churn['gender']=='Male')&(churn['geography']=='Spain')
Out[242]:
0        False
1        False
2        False
3        False
5         True
         ...  
9997      True
10006    False
10025    False
10030    False
10031    False
Length: 9973, dtype: bool
In [243]:
churn[(churn['gender']=='Male')&(churn['geography']=='Spain')]
Out[243]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
5 Chu 645.00 Spain Male 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00
12 Andrews 497.00 Spain Male 24.00 3.00 0.00 2.00 0.00 76390.01 0.00
25 Muldrow 587.00 Spain Male 45.00 6.00 0.00 1.00 0.00 158684.81 0.00
43 Watson 490.00 Spain Male 31.00 3.00 145260.23 1.00 1.00 114066.77 0.00
44 Lorenzo 804.00 Spain Male 33.00 7.00 76548.60 1.00 1.00 98453.45 0.00
... ... ... ... ... ... ... ... ... ... ... ...
9976 Graham 764.00 Spain Male 38.00 4.00 113607.47 1.00 0.00 91094.46 0.00
9978 Aldridge 573.00 Spain Male 44.00 9.00 0.00 2.00 0.00 107124.17 0.00
9979 Flynn 702.00 Spain Male 44.00 9.00 0.00 1.00 0.00 59207.41 1.00
9983 Ch'en 479.00 Spain Male 35.00 4.00 125920.98 1.00 1.00 20393.44 0.00
9997 T'ao 741.00 Spain Male 35.00 6.00 74371.49 1.00 0.00 99595.67 0.00

1385 rows × 11 columns

In [244]:
churn[(churn['gender']=='Male')&(churn['geography']=='Spain')|(churn['balance']=='159660.8')]
Out[244]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
5 Chu 645.00 Spain Male 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00
12 Andrews 497.00 Spain Male 24.00 3.00 0.00 2.00 0.00 76390.01 0.00
25 Muldrow 587.00 Spain Male 45.00 6.00 0.00 1.00 0.00 158684.81 0.00
43 Watson 490.00 Spain Male 31.00 3.00 145260.23 1.00 1.00 114066.77 0.00
44 Lorenzo 804.00 Spain Male 33.00 7.00 76548.60 1.00 1.00 98453.45 0.00
... ... ... ... ... ... ... ... ... ... ... ...
9976 Graham 764.00 Spain Male 38.00 4.00 113607.47 1.00 0.00 91094.46 0.00
9978 Aldridge 573.00 Spain Male 44.00 9.00 0.00 2.00 0.00 107124.17 0.00
9979 Flynn 702.00 Spain Male 44.00 9.00 0.00 1.00 0.00 59207.41 1.00
9983 Ch'en 479.00 Spain Male 35.00 4.00 125920.98 1.00 1.00 20393.44 0.00
9997 T'ao 741.00 Spain Male 35.00 6.00 74371.49 1.00 0.00 99595.67 0.00

1385 rows × 11 columns

In [ ]:
 
In [245]:
churn[(churn['gender']=='Male')&(churn['geography']=='Spain')|(churn['surname']=='Lorenzo')]
Out[245]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
5 Chu 645.00 Spain Male 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00
12 Andrews 497.00 Spain Male 24.00 3.00 0.00 2.00 0.00 76390.01 0.00
25 Muldrow 587.00 Spain Male 45.00 6.00 0.00 1.00 0.00 158684.81 0.00
43 Watson 490.00 Spain Male 31.00 3.00 145260.23 1.00 1.00 114066.77 0.00
44 Lorenzo 804.00 Spain Male 33.00 7.00 76548.60 1.00 1.00 98453.45 0.00
... ... ... ... ... ... ... ... ... ... ... ...
9976 Graham 764.00 Spain Male 38.00 4.00 113607.47 1.00 0.00 91094.46 0.00
9978 Aldridge 573.00 Spain Male 44.00 9.00 0.00 2.00 0.00 107124.17 0.00
9979 Flynn 702.00 Spain Male 44.00 9.00 0.00 1.00 0.00 59207.41 1.00
9983 Ch'en 479.00 Spain Male 35.00 4.00 125920.98 1.00 1.00 20393.44 0.00
9997 T'ao 741.00 Spain Male 35.00 6.00 74371.49 1.00 0.00 99595.67 0.00

1395 rows × 11 columns

NUMBEROFPRODUCTS¶

QUESTION 18 WHO HAS THE HIGHEST NUMBER OF PRODUCTS¶

In [246]:
churn.head(2)
Out[246]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
In [247]:
churn.numofproducts.value_counts()
Out[247]:
1.00    5069
2.00    4578
3.00     266
4.00      60
Name: numofproducts, dtype: int64
In [248]:
churn.numofproducts.unique()
Out[248]:
array([1., 3., 2., 4.])
In [249]:
churn.head(2)
Out[249]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00

QUESTION 19 , WHICH GENDER HAS THE HIGHEST AVERAGE BALANCE¶

In [250]:
churn.columns
Out[250]:
Index(['surname', 'creditscore', 'geography', 'gender', 'age', 'tenure',
       'balance', 'numofproducts', 'isactivemember', 'estimatedsalary',
       'exited'],
      dtype='object')
In [251]:
churn.groupby('gender')['balance'].mean().sort_values(ascending=False)
Out[251]:
gender
Male     77170.09
Female   75693.28
Name: balance, dtype: float64
In [252]:
sns.barplot(x='gender',y='balance',data=churn)
plt.title('gender by balance')
plt.show()

Q 20,DISPLAY TO 10 BALANCE AND GEOGRAPHY¶

In [253]:
churn.columns
Out[253]:
Index(['surname', 'creditscore', 'geography', 'gender', 'age', 'tenure',
       'balance', 'numofproducts', 'isactivemember', 'estimatedsalary',
       'exited'],
      dtype='object')
In [254]:
churn_top10=churn.nlargest(10,('balance'))[['geography']]
In [255]:
churn_top10
Out[255]:
geography
2099 Spain
3287 France
8750 Spain
3595 Spain
6734 Spain
1074 Germany
145 Spain
1540 France
3927 Germany
2716 France
In [256]:
churn_top10=churn.nlargest(10,('balance'))[['geography']].set_index('geography')
In [257]:
churn_top10
Out[257]:
geography
Spain
France
Spain
Spain
Spain
Germany
Spain
France
Germany
France
In [258]:
churn_top10=churn.nsmallest(10,('balance'))[['geography']]
In [259]:
churn_top10
Out[259]:
geography
0 France
3 France
12 Spain
13 France
14 France
18 Spain
24 Spain
25 Spain
26 France
27 France
In [260]:
churn_top10=churn.nlargest(10,('balance'))[['geography','gender']]
In [261]:
churn_top10
Out[261]:
geography gender
2099 Spain Male
3287 France Female
8750 Spain Male
3595 Spain Male
6734 Spain Female
1074 Germany Male
145 Spain Female
1540 France Female
3927 Germany Male
2716 France Male
In [ ]:
 

Q21 , DISPLAY TOTAL NUMBER OF SALARY IN THE DATASET¶

In [262]:
churn['estimatedsalary'].value_counts()
Out[262]:
62825.03     3
24924.92     2
101348.88    1
182692.80    1
38466.39     1
            ..
2465.80      1
47799.15     1
165562.84    1
72927.68     1
38190.78     1
Name: estimatedsalary, Length: 9970, dtype: int64
In [263]:
sns.countplot(x='estimatedsalary',data=churn)
plt.title('no of estimatedsalary')
plt.show()

QUESTION 22, FIND THE MOST POPULAR GEOGRAPHY WITH HIGHEST CREDITSCORE¶

In [264]:
churn.columns
Out[264]:
Index(['surname', 'creditscore', 'geography', 'gender', 'age', 'tenure',
       'balance', 'numofproducts', 'isactivemember', 'estimatedsalary',
       'exited'],
      dtype='object')
In [265]:
churn['creditscore'].max()
Out[265]:
850.0
In [266]:
churn['creditscore']
Out[266]:
0       619.00
1       608.00
2       502.00
3       699.00
5       645.00
         ...  
9997    741.00
10006   659.00
10025   771.00
10030   772.00
10031   792.00
Name: creditscore, Length: 9973, dtype: float64
In [267]:
[churn['creditscore'].max()==churn['creditscore']]
Out[267]:
[0        False
 1        False
 2        False
 3        False
 5        False
          ...  
 9997     False
 10006    False
 10025    False
 10030    False
 10031    False
 Name: creditscore, Length: 9973, dtype: bool]
In [268]:
churn[churn['creditscore'].max()==churn['creditscore']]
Out[268]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
45 Armstrong 850.00 France Male 36.00 7.00 0.00 1.00 1.00 40812.90 0.00
187 Rozier 850.00 Spain Female 45.00 2.00 122311.21 1.00 1.00 19482.50 0.00
207 Chiemezie 850.00 Spain Male 30.00 2.00 141040.01 1.00 1.00 5978.20 0.00
230 Stevenson 850.00 France Male 33.00 10.00 0.00 1.00 0.00 4861.72 1.00
266 Welch 850.00 Germany Male 38.00 3.00 54901.01 1.00 1.00 140075.55 0.00
... ... ... ... ... ... ... ... ... ... ... ...
9598 Glover 850.00 France Female 28.00 9.00 0.00 2.00 1.00 164864.67 0.00
9663 Muir 850.00 Spain Male 71.00 10.00 69608.14 1.00 0.00 97893.40 1.00
9705 Ward 850.00 France Male 68.00 5.00 169445.40 1.00 1.00 186335.07 0.00
9948 Greco 850.00 France Female 34.00 6.00 101266.51 1.00 0.00 33501.98 0.00
9966 Wan 850.00 France Female 47.00 9.00 137301.87 1.00 0.00 44351.77 0.00

231 rows × 11 columns

In [269]:
churn[churn['creditscore'].max()==churn['creditscore']]['geography']
Out[269]:
45       France
187       Spain
207       Spain
230      France
266     Germany
         ...   
9598     France
9663      Spain
9705     France
9948     France
9966     France
Name: geography, Length: 231, dtype: object

Q 23, DOES SALARY AFFECT BALANCE¶

In [270]:
churn.columns
Out[270]:
Index(['surname', 'creditscore', 'geography', 'gender', 'age', 'tenure',
       'balance', 'numofproducts', 'isactivemember', 'estimatedsalary',
       'exited'],
      dtype='object')
In [271]:
sns.scatterplot(x='estimatedsalary',y='balance',data=churn)
Out[271]:
<AxesSubplot:xlabel='estimatedsalary', ylabel='balance'>

Q24, CLASSIFY CREDITSCORE BASED ON EXITED¶

In [272]:
churn
Out[272]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
2 Onio 502.00 France Female 42.00 8.00 159660.80 3.00 0.00 113931.57 1.00
3 Boni 699.00 France Female 39.00 1.00 0.00 2.00 0.00 93826.63 0.00
5 Chu 645.00 Spain Male 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00
... ... ... ... ... ... ... ... ... ... ... ...
9997 T'ao 741.00 Spain Male 35.00 6.00 74371.49 1.00 0.00 99595.67 0.00
10006 Nepean 659.00 France Male 36.00 6.00 123841.49 2.00 0.00 96833.00 0.00
10025 Obijiaku 771.00 France Male 39.00 5.00 0.00 2.00 0.00 96270.64 0.00
10030 Sabbatini 772.00 Germany Male 42.00 3.00 75075.31 2.00 0.00 92888.52 1.00
10031 Walker 792.00 France Female 28.00 4.00 130142.79 1.00 0.00 38190.78 0.00

9973 rows × 11 columns

In [ ]:
 
In [273]:
churn.head(2)
Out[273]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00

VISUALYZING AND EXPLORATORY DATA ANALYSIS (EDA)¶

GEOGRAPHY¶

In [274]:
churn.geography.unique()
Out[274]:
array(['France', 'Spain', 'Germany'], dtype=object)
In [275]:
sns.set(style='whitegrid')
plt.figure(figsize=(14, 7))
labels=['France', 'Spain','Germany'],
plt.pie(churn['geography'].value_counts(),labels=['France', 'Spain','Germany'],explode=[0.1,0.1,0.1],
        autopct='%1.2f%%',colors=['#E37383','#FFC0CB'], startangle=90)
plt.title('geography')
plt.axis('equal')
plt.show()

GENDER¶

In [276]:
plt.figure(figsize=(8,5))
sns.countplot('gender', data = churn, color='#00ddff', saturation=0.9)
Out[276]:
<AxesSubplot:xlabel='gender', ylabel='count'>
In [277]:
plt.figure(figsize=(20,20))
sns.heatmap(churn.corr(), vmin=-1, cmap="plasma_r", annot=True)
#same thing can be seen from the correlation as well
Out[277]:
<AxesSubplot:>
In [278]:
churn.head(2)
Out[278]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
In [279]:
sns.scatterplot(x='age',y='exited',data=churn)
Out[279]:
<AxesSubplot:xlabel='age', ylabel='exited'>
In [280]:
sns.scatterplot(x='age',y='exited',data=churn,hue='gender',palette='RdYlBu')
Out[280]:
<AxesSubplot:xlabel='age', ylabel='exited'>
In [281]:
sns.barplot(x='gender',y='exited',data=churn,palette='RdYlBu')
Out[281]:
<AxesSubplot:xlabel='gender', ylabel='exited'>
In [282]:
sns.histplot(churn['tenure'],kde=True,bins=15)
Out[282]:
<AxesSubplot:xlabel='tenure', ylabel='Count'>
In [ ]:
 
In [283]:
sns.boxplot(x='geography',y='exited',data=churn,hue='gender',palette='YlGnBu')
Out[283]:
<AxesSubplot:xlabel='geography', ylabel='exited'>
In [284]:
sns.histplot(churn['estimatedsalary'],kde=True,bins=15)
Out[284]:
<AxesSubplot:xlabel='estimatedsalary', ylabel='Count'>
In [285]:
sns.barplot(x='gender',y='estimatedsalary',data=churn,palette='RdYlBu')
Out[285]:
<AxesSubplot:xlabel='gender', ylabel='estimatedsalary'>
In [286]:
sns.boxplot(x='geography',y='estimatedsalary',data=churn,hue='gender',palette='YlGnBu')
Out[286]:
<AxesSubplot:xlabel='geography', ylabel='estimatedsalary'>
In [287]:
sns.histplot(churn['balance'],kde=True,bins=15)
Out[287]:
<AxesSubplot:xlabel='balance', ylabel='Count'>
In [288]:
sns.barplot(x='gender',y='balance',data=churn,palette='RdYlBu')
Out[288]:
<AxesSubplot:xlabel='gender', ylabel='balance'>
In [289]:
sns.boxplot(x='geography',y='balance',data=churn,hue='gender',palette='YlGnBu')
Out[289]:
<AxesSubplot:xlabel='geography', ylabel='balance'>
In [290]:
sns.histplot(churn['creditscore'],kde=True,bins=15)
Out[290]:
<AxesSubplot:xlabel='creditscore', ylabel='Count'>
In [291]:
sns.barplot(x='gender',y='creditscore',data=churn,palette='RdYlBu')
Out[291]:
<AxesSubplot:xlabel='gender', ylabel='creditscore'>
In [292]:
sns.boxplot(x='geography',y='creditscore',data=churn,hue='gender',palette='YlGnBu')
Out[292]:
<AxesSubplot:xlabel='geography', ylabel='creditscore'>
In [293]:
sns.stripplot(x='geography',y='creditscore',data=churn,hue='gender',dodge=True ,palette='YlGnBu')
Out[293]:
<AxesSubplot:xlabel='geography', ylabel='creditscore'>
In [294]:
sns.stripplot(x='geography',y='exited',data=churn,hue='gender',dodge=True ,palette='YlGnBu')
Out[294]:
<AxesSubplot:xlabel='geography', ylabel='exited'>
In [ ]:
 
In [295]:
churn.head(2)
Out[295]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
In [296]:
sns.jointplot(x='exited',y='balance',data=churn,kind='reg')
Out[296]:
<seaborn.axisgrid.JointGrid at 0x1b50039c3a0>
In [297]:
sns.jointplot(x='exited',y='estimatedsalary',data=churn)
Out[297]:
<seaborn.axisgrid.JointGrid at 0x1b50016b490>
In [298]:
sns.jointplot(x='creditscore',y='estimatedsalary',data=churn,)
Out[298]:
<seaborn.axisgrid.JointGrid at 0x1b57ff08a00>
In [299]:
sns.jointplot(x='creditscore',y='estimatedsalary',data=churn,kind='kde')
Out[299]:
<seaborn.axisgrid.JointGrid at 0x1b57ed06a00>
In [300]:
sns.jointplot(x='creditscore',y='estimatedsalary',data=churn,kind='kde',shade=True)
Out[300]:
<seaborn.axisgrid.JointGrid at 0x1b57eb51310>
In [301]:
sns.jointplot(x='creditscore',y='estimatedsalary',data=churn,kind='kde',shade=True,cmap='YlGnBu')
Out[301]:
<seaborn.axisgrid.JointGrid at 0x1b57e98f940>
In [302]:
sns.jointplot(x='creditscore',y='estimatedsalary',data=churn,kind='hex',cmap='YlGnBu')
Out[302]:
<seaborn.axisgrid.JointGrid at 0x1b57e7cdaf0>
In [303]:
sns.countplot(x='exited',hue='gender',data=churn)
Out[303]:
<AxesSubplot:xlabel='exited', ylabel='count'>
In [304]:
sns.countplot(x='exited',hue='geography',data=churn)
Out[304]:
<AxesSubplot:xlabel='exited', ylabel='count'>
In [305]:
churn['age'].plot.hist()
Out[305]:
<AxesSubplot:ylabel='Frequency'>
In [306]:
churn['balance'].plot.hist(bins=20,figsize=(10,5))
Out[306]:
<AxesSubplot:ylabel='Frequency'>
In [307]:
churn['creditscore'].plot.hist(bins=20,figsize=(10,5))
Out[307]:
<AxesSubplot:ylabel='Frequency'>
In [308]:
churn['estimatedsalary'].plot.hist(bins=20,figsize=(10,5))
Out[308]:
<AxesSubplot:ylabel='Frequency'>
In [309]:
churn.head(2)
Out[309]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
In [310]:
churn['exited'].unique
Out[310]:
<bound method Series.unique of 0       1.00
1       0.00
2       1.00
3       0.00
5       1.00
        ... 
9997    0.00
10006   0.00
10025   0.00
10030   1.00
10031   0.00
Name: exited, Length: 9973, dtype: float64>
In [311]:
#visualization of this data
cols=['creditscore','geography','gender','age','tenure','balance','numofproducts','isactivemember','estimatedsalary','exited']
n_rows=2
n_cols=3
#the subplot grid and figsize of the graph
fig,  axs =plt.subplots(n_rows,n_cols,figsize=(n_cols*3.2,n_rows*3.2))
for r in range(0,n_rows):
    for c in range (0,n_cols):
        i = r*n_cols+c  #index to go through the number of columns
        ax= axs[r][c]   #show were to position each subplot
        sns.countplot(churn[cols[i]],hue=churn['exited'],ax=ax)
        ax.set_title(cols[i])
        ax.legend(title='exited',loc='upper right')
        plt.tight_layout
In [312]:
churn.pivot_table('exited',index='gender',columns='geography').plot()
Out[312]:
<AxesSubplot:xlabel='gender'>
In [313]:
f, ax =plt.subplots(1,2,figsize= (18,8))
churn['exited'].value_counts().plot.pie(explode=[0,0.1,],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('exited')
ax[0].set_ylabel('')
sns.countplot('exited',data=churn,ax=ax[1])
ax[1].set_title('exited')
plt.show()
In [314]:
churn.head(2)
Out[314]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
In [315]:
churn_gender = churn.groupby(['gender'])[['creditscore', 'balance', 'estimatedsalary']].mean()
In [316]:
colors_4 = ['magenta','yellow','green','red']
colors_3 =['magenta','yellow','green']
In [317]:
churn_gender.plot.pie(subplots=True, figsize=(20,10), labels=churn_gender.index, autopct='%1.1f%%',colors=colors_4)
plt.show()
In [318]:
churn_geography = churn.groupby(['geography'])[['creditscore', 'balance', 'estimatedsalary']].mean()
In [319]:
churn_geography.plot.pie(subplots=True, figsize=(20,10), labels=churn_geography.index, autopct='%1.1f%%',colors=colors_4)
plt.show()
In [320]:
sns.factorplot('geography','exited',hue='gender',data=churn)
Out[320]:
<seaborn.axisgrid.FacetGrid at 0x1b51c537a30>
In [321]:
f, ax = plt.subplots(1,2,figsize=(18,8,))
sns.violinplot('geography','age', hue='exited',data=churn,ax=ax[0])
ax[0].set_title('geography and age vs exited')
ax[0].set_yticks(range(0,110,10))
sns.violinplot('gender','age',hue='exited',data=churn,ax=ax[1])
ax[1].set_title('gender and age vs exited')
ax[1].set_yticks(range(0,110,10))
plt.show()
In [322]:
f, ax=plt.subplots(1,2,figsize=(12,10))
churn[churn['exited']==0].age.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('exited=0')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
churn[churn['exited']==1].age.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black')
ax[1].set_title('exited=1')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()
In [323]:
plt.figure(figsize=(30,30))
sns.set_context("talk",font_scale=1)
sns.set_palette("pastel")
ax = sns.countplot(y="age", hue="gender", data=churn)
ax.legend(loc='upper right',frameon=True)
plt.title('GENDER vs AGE', fontsize=18, fontweight='bold')
ax.set(xlabel='COUNT OF GENDER',ylabel='AGE')
plt.show()
In [324]:
plt.figure(figsize=(6,6))
sns.set_context("talk",font_scale=1)
sns.set_palette("pastel")
ax = sns.countplot(y="gender", hue="exited", data=churn)
ax.legend(loc='upper right',frameon=True)
plt.title('exited vs gender', fontsize=8, fontweight='bold')
ax.set(xlabel='COUNT OF exited ',ylabel='gender')
plt.show()
In [325]:
plt.figure(figsize=(8,8))
plt.title('Correlation Analysis',color='Red',fontsize=20,pad=40)

corr = churn.corr()
mask = np.triu(np.ones_like(corr,dtype = bool))
sns.heatmap(churn.corr(),mask=mask,annot=True,linewidths=.5);
plt.xticks(rotation=60)
plt.yticks(rotation = 60)
plt.show()
In [326]:
colors = ['#E94B3C','#2D2926']

exited = churn[churn['exited'] == 1].describe().T
not_exited = churn[churn['exited'] == 0].describe().T

fig,ax = plt.subplots(nrows = 1,ncols = 2,figsize = (5,5))
plt.subplot(1,2,1)
sns.heatmap(exited[['mean']],annot = True,cmap = colors,linewidths = 0.4,linecolor = 'black',cbar = False,fmt = '.2f')
plt.title('exited Customers');

plt.subplot(1,2,2)
sns.heatmap(not_exited[['mean']],annot = True,cmap = colors,linewidths = 0.4,linecolor = 'black',cbar = False,fmt = '.2f',)
plt.title('Not_exited Customers');

fig.tight_layout(pad = 0)
In [327]:
l = list(churn['exited'].value_counts())
circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]

fig = plt.subplots(nrows = 1,ncols = 2,figsize = (20,5))
plt.subplot(1,2,1)
plt.pie(circle,labels = ['Not-exited Customer','exited Customer'],autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),colors = colors,
       wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
plt.title('exited - Not-exited %');

plt.subplot(1,2,2)
ax = sns.countplot('exited',data = churn,palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 2, rect.get_height(), horizontalalignment='center', fontsize = 11)

    
plt.title('Number of Exited - Not-Exited Customers');
plt.show()

DATA SCIENCE¶

In [328]:
churn.head(5)
Out[328]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
2 Onio 502.00 France Female 42.00 8.00 159660.80 3.00 0.00 113931.57 1.00
3 Boni 699.00 France Female 39.00 1.00 0.00 2.00 0.00 93826.63 0.00
5 Chu 645.00 Spain Male 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00

DATA PREPROCESSING¶

converting categorical variable to numerical cvariable we can¶

use 2 methods ENCODING AND MAPP METHODS¶

MAP METHOD OF CONVERTING CATEGORICAL TO NUMERICAL¶

In [329]:
churn.head()
Out[329]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
2 Onio 502.00 France Female 42.00 8.00 159660.80 3.00 0.00 113931.57 1.00
3 Boni 699.00 France Female 39.00 1.00 0.00 2.00 0.00 93826.63 0.00
5 Chu 645.00 Spain Male 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00
In [330]:
churn=churn.drop(churn.columns[[0]],axis=1)
In [331]:
churn.dtypes
Out[331]:
creditscore        float64
geography           object
gender              object
age                float64
tenure             float64
balance            float64
numofproducts      float64
isactivemember     float64
estimatedsalary    float64
exited             float64
dtype: object
In [332]:
churn_catg
Out[332]:
surname geography gender
0 Hargrave France Female
1 Hill Spain Female
2 Onio France Female
3 Boni France Female
4 NaN NaN NaN
... ... ... ...
10027 Johnstone France Male
10028 Liu France Female
10029 Liu France Female
10030 Sabbatini Germany Male
10031 Walker France Female

10032 rows × 3 columns

In [333]:
churn_cont.head(2)
Out[333]:
rownumber customerid creditscore age tenure balance numofproducts hascrcard isactivemember estimatedsalary exited
0 1.00 15634602.00 619.00 42.00 2.00 0.00 1.00 1.00 1.00 101348.88 1.00
1 2.00 NaN 608.00 41.00 1.00 83807.86 1.00 0.00 1.00 112542.58 0.00
In [334]:
churn1
Out[334]:
surname creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 Hargrave 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 Hill 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
2 Onio 502.00 France Female 42.00 8.00 159660.80 3.00 0.00 113931.57 1.00
3 Boni 699.00 France Female 39.00 1.00 0.00 2.00 0.00 93826.63 0.00
5 Chu 645.00 Spain Male 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00
... ... ... ... ... ... ... ... ... ... ... ...
9997 T'ao 741.00 Spain Male 35.00 6.00 74371.49 1.00 0.00 99595.67 0.00
10006 Nepean 659.00 France Male 36.00 6.00 123841.49 2.00 0.00 96833.00 0.00
10025 Obijiaku 771.00 France Male 39.00 5.00 0.00 2.00 0.00 96270.64 0.00
10030 Sabbatini 772.00 Germany Male 42.00 3.00 75075.31 2.00 0.00 92888.52 1.00
10031 Walker 792.00 France Female 28.00 4.00 130142.79 1.00 0.00 38190.78 0.00

9913 rows × 11 columns

In [335]:
churn.head()
Out[335]:
creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
2 502.00 France Female 42.00 8.00 159660.80 3.00 0.00 113931.57 1.00
3 699.00 France Female 39.00 1.00 0.00 2.00 0.00 93826.63 0.00
5 645.00 Spain Male 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00

REPLACE METHOD¶

In [336]:
#geography1={'geography':{'France':0,'Spain':1,'Germany':2}}
In [337]:
#churn=churn.replace(geography1)
In [338]:
churn.head()
Out[338]:
creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited
0 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00
1 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00
2 502.00 France Female 42.00 8.00 159660.80 3.00 0.00 113931.57 1.00
3 699.00 France Female 39.00 1.00 0.00 2.00 0.00 93826.63 0.00
5 645.00 Spain Male 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00

LABEL ENCODER METHOD¶

In [339]:
from sklearn.preprocessing import LabelEncoder
In [340]:
le=LabelEncoder()
In [341]:
churn['geography1']=le.fit_transform(churn.geography)
In [342]:
churn.head()
Out[342]:
creditscore geography gender age tenure balance numofproducts isactivemember estimatedsalary exited geography1
0 619.00 France Female 42.00 2.00 0.00 1.00 1.00 101348.88 1.00 0
1 608.00 Spain Female 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00 2
2 502.00 France Female 42.00 8.00 159660.80 3.00 0.00 113931.57 1.00 0
3 699.00 France Female 39.00 1.00 0.00 2.00 0.00 93826.63 0.00 0
5 645.00 Spain Male 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00 2

dummies variable methiod¶

In [343]:
churn=pd.get_dummies(churn,drop_first=True)
In [344]:
churn.head()
Out[344]:
creditscore age tenure balance numofproducts isactivemember estimatedsalary exited geography1 geography_Germany geography_Spain gender_Male
0 619.00 42.00 2.00 0.00 1.00 1.00 101348.88 1.00 0 0 0 0
1 608.00 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00 2 0 1 0
2 502.00 42.00 8.00 159660.80 3.00 0.00 113931.57 1.00 0 0 0 0
3 699.00 39.00 1.00 0.00 2.00 0.00 93826.63 0.00 0 0 0 0
5 645.00 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00 2 0 1 1
In [345]:
#churn=pd.get_dummies(churn,columns=['gender'],prefix='gender',drop_first=True)

INBALANCE DATASET¶

UNDERSAMPLING AND OVERSAMPLING METHODS¶

In [346]:
churn.head(2)
Out[346]:
creditscore age tenure balance numofproducts isactivemember estimatedsalary exited geography1 geography_Germany geography_Spain gender_Male
0 619.00 42.00 2.00 0.00 1.00 1.00 101348.88 1.00 0 0 0 0
1 608.00 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00 2 0 1 0
In [347]:
churn['exited'].value_counts()
Out[347]:
0.00    7943
1.00    2030
Name: exited, dtype: int64
In [348]:
sns.countplot(churn['exited'])
Out[348]:
<AxesSubplot:xlabel='exited', ylabel='count'>

store feature matrix in X and response target in vector y¶

In [349]:
X= churn.drop('exited',axis=1)
In [350]:
y=churn['exited']
In [351]:
X
Out[351]:
creditscore age tenure balance numofproducts isactivemember estimatedsalary geography1 geography_Germany geography_Spain gender_Male
0 619.00 42.00 2.00 0.00 1.00 1.00 101348.88 0 0 0 0
1 608.00 41.00 1.00 83807.86 1.00 1.00 112542.58 2 0 1 0
2 502.00 42.00 8.00 159660.80 3.00 0.00 113931.57 0 0 0 0
3 699.00 39.00 1.00 0.00 2.00 0.00 93826.63 0 0 0 0
5 645.00 44.00 8.00 113755.78 2.00 0.00 149756.71 2 0 1 1
... ... ... ... ... ... ... ... ... ... ... ...
9997 741.00 35.00 6.00 74371.49 1.00 0.00 99595.67 2 0 1 1
10006 659.00 36.00 6.00 123841.49 2.00 0.00 96833.00 0 0 0 1
10025 771.00 39.00 5.00 0.00 2.00 0.00 96270.64 0 0 0 1
10030 772.00 42.00 3.00 75075.31 2.00 0.00 92888.52 1 1 0 1
10031 792.00 28.00 4.00 130142.79 1.00 0.00 38190.78 0 0 0 0

9973 rows × 11 columns

In [352]:
y
Out[352]:
0       1.00
1       0.00
2       1.00
3       0.00
5       1.00
        ... 
9997    0.00
10006   0.00
10025   0.00
10030   1.00
10031   0.00
Name: exited, Length: 9973, dtype: float64

split dataset into training set and test set¶

In [353]:
from sklearn.model_selection import train_test_split
In [354]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=42,stratify=y)
In [355]:
X_train
Out[355]:
creditscore age tenure balance numofproducts isactivemember estimatedsalary geography1 geography_Germany geography_Spain gender_Male
5800 624.00 71.00 7.00 0.00 2.00 1.00 108841.83 0 0 0 1
5401 727.00 28.00 1.00 0.00 1.00 0.00 40357.39 2 0 1 1
2369 709.00 45.00 4.00 122917.71 1.00 1.00 11.58 1 1 0 1
7532 623.00 35.00 0.00 130557.24 1.00 1.00 47880.71 0 0 0 0
3495 587.00 35.00 3.00 83286.56 1.00 0.00 125553.52 2 0 1 0
... ... ... ... ... ... ... ... ... ... ... ...
9692 735.00 39.00 1.00 60374.98 1.00 0.00 40223.74 2 0 1 1
4820 517.00 39.00 3.00 0.00 2.00 1.00 12465.51 2 0 1 1
3082 556.00 35.00 10.00 0.00 2.00 1.00 192751.18 0 0 0 1
2558 537.00 53.00 3.00 0.00 1.00 1.00 91406.62 0 0 0 0
9301 633.00 61.00 3.00 157201.48 1.00 1.00 50368.63 0 0 0 1

7978 rows × 11 columns

In [356]:
y_train
Out[356]:
5800   0.00
5401   0.00
2369   1.00
7532   0.00
3495   0.00
       ... 
9692   0.00
4820   0.00
3082   0.00
2558   0.00
9301   0.00
Name: exited, Length: 7978, dtype: float64

MACHINE LEARNING ALGRITHMS¶

LOGISTIC REGRESSION¶

In [357]:
churn.head()
Out[357]:
creditscore age tenure balance numofproducts isactivemember estimatedsalary exited geography1 geography_Germany geography_Spain gender_Male
0 619.00 42.00 2.00 0.00 1.00 1.00 101348.88 1.00 0 0 0 0
1 608.00 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00 2 0 1 0
2 502.00 42.00 8.00 159660.80 3.00 0.00 113931.57 1.00 0 0 0 0
3 699.00 39.00 1.00 0.00 2.00 0.00 93826.63 0.00 0 0 0 0
5 645.00 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00 2 0 1 1
In [358]:
X
Out[358]:
creditscore age tenure balance numofproducts isactivemember estimatedsalary geography1 geography_Germany geography_Spain gender_Male
0 619.00 42.00 2.00 0.00 1.00 1.00 101348.88 0 0 0 0
1 608.00 41.00 1.00 83807.86 1.00 1.00 112542.58 2 0 1 0
2 502.00 42.00 8.00 159660.80 3.00 0.00 113931.57 0 0 0 0
3 699.00 39.00 1.00 0.00 2.00 0.00 93826.63 0 0 0 0
5 645.00 44.00 8.00 113755.78 2.00 0.00 149756.71 2 0 1 1
... ... ... ... ... ... ... ... ... ... ... ...
9997 741.00 35.00 6.00 74371.49 1.00 0.00 99595.67 2 0 1 1
10006 659.00 36.00 6.00 123841.49 2.00 0.00 96833.00 0 0 0 1
10025 771.00 39.00 5.00 0.00 2.00 0.00 96270.64 0 0 0 1
10030 772.00 42.00 3.00 75075.31 2.00 0.00 92888.52 1 1 0 1
10031 792.00 28.00 4.00 130142.79 1.00 0.00 38190.78 0 0 0 0

9973 rows × 11 columns

FEAUTURE SCALING¶

In [359]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
In [360]:
sc=StandardScaler()
In [361]:
X_train=sc.fit_transform(X_train)
In [362]:
X_test=sc.fit_transform(X_test)
In [363]:
X_train
Out[363]:
array([[-0.26906411,  3.04660636,  0.69093269, ..., -0.57918359,
        -0.57628887,  0.9139413 ],
       [ 0.79139897, -1.03510395, -1.38121605, ..., -0.57918359,
         1.73524086,  0.9139413 ],
       [ 0.60607532,  0.57859547, -0.34514168, ...,  1.72656825,
        -0.57628887,  0.9139413 ],
       ...,
       [-0.96917566, -0.37063949,  1.72700706, ..., -0.57918359,
        -0.57628887,  0.9139413 ],
       [-1.16479506,  1.33798344, -0.6904998 , ..., -0.57918359,
        -0.57628887, -1.09416217],
       [-0.17640229,  2.0973714 , -0.6904998 , ..., -0.57918359,
        -0.57628887,  0.9139413 ]])
In [364]:
from sklearn.linear_model import LogisticRegression
In [365]:
log=LogisticRegression()
In [366]:
log.fit(X_train,y_train)
Out[366]:
LogisticRegression()
In [367]:
y_pred1=log.predict(X_test)
In [368]:
from sklearn.metrics import accuracy_score
In [369]:
accuracy_score(y_test,y_pred1)
Out[369]:
0.8095238095238095
In [370]:
from sklearn.metrics import precision_score,recall_score,f1_score
In [371]:
precision_score(y_test,y_pred1)
Out[371]:
0.5955882352941176
In [372]:
recall_score(y_test,y_pred1)
Out[372]:
0.19950738916256158
In [373]:
f1_score(y_test,y_pred1)
Out[373]:
0.2988929889298893

EVALUATE IMBALANCE DATA SET FIRST BY UNDERSAMPLING¶

In [374]:
normal=churn[churn['exited']==0]
fraud=churn[churn['exited']==1]
In [375]:
normal.shape
Out[375]:
(7943, 12)
In [376]:
fraud.shape
Out[376]:
(2030, 12)
In [377]:
normal_sample=normal.sample(n=2055)
In [378]:
normal_sample.shape
Out[378]:
(2055, 12)
In [379]:
new_churn=pd.concat([normal_sample,fraud])
In [380]:
new_churn['exited'].value_counts()
Out[380]:
0.00    2055
1.00    2030
Name: exited, dtype: int64
In [381]:
new_churn.head()
Out[381]:
creditscore age tenure balance numofproducts isactivemember estimatedsalary exited geography1 geography_Germany geography_Spain gender_Male
3921 658.00 44.00 2.00 168396.34 1.00 1.00 14178.73 0.00 0 0 0 1
8311 704.00 36.00 2.00 175509.80 2.00 0.00 152039.67 0.00 2 0 1 0
5438 589.00 39.00 7.00 0.00 2.00 0.00 95985.64 0.00 0 0 0 0
3044 693.00 21.00 1.00 0.00 2.00 1.00 3494.02 0.00 0 0 0 1
8989 650.00 32.00 4.00 79450.09 1.00 1.00 118324.75 0.00 2 0 1 0
In [382]:
X=new_churn.drop('exited',axis=1)
In [383]:
y=new_churn['exited']
In [384]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=42)
In [385]:
log=LogisticRegression()
log.fit(X_train,y_train)
Out[385]:
LogisticRegression()
In [386]:
y_pred1=log.predict(X_test)
In [387]:
accuracy_score(y_test,y_pred1)
Out[387]:
0.594859241126071
In [388]:
from sklearn.metrics import precision_score,recall_score,f1_score
In [389]:
precision_score(y_test,y_pred1)
Out[389]:
0.5618448637316562
In [390]:
recall_score(y_test,y_pred1)
Out[390]:
0.6871794871794872
In [391]:
f1_score(y_test,y_pred1)
Out[391]:
0.6182237600922722

DECISION TREE CLASSIFIER¶

In [392]:
from sklearn.tree import DecisionTreeClassifier
In [393]:
dt=DecisionTreeClassifier()
In [394]:
dt.fit(X_train,y_train)
Out[394]:
DecisionTreeClassifier()
In [395]:
y_pred2=dt.predict(X_test)
In [396]:
accuracy_score(y_test,y_pred2)
Out[396]:
0.7025703794369645
In [397]:
f1_score(y_test,y_pred2)
Out[397]:
0.6864516129032259
In [ ]:
 
In [ ]:
 
In [398]:
precision_score(y_test,y_pred2)
Out[398]:
0.6909090909090909
In [399]:
recall_score(y_test,y_pred2)
Out[399]:
0.6820512820512821

RANDOM FOREST CLASSIFIER¶

In [400]:
from sklearn.ensemble import RandomForestClassifier
In [401]:
rf=RandomForestClassifier()
In [402]:
rf.fit(X_train,y_train)
Out[402]:
RandomForestClassifier()
In [403]:
y_pred3=rf.predict(X_test)
In [404]:
accuracy_score(y_test,y_pred3)
Out[404]:
0.7870257037943696
In [405]:
f1_score(y_test,y_pred3)
Out[405]:
0.7740259740259741
In [406]:
precision_score(y_test,y_pred3)
Out[406]:
0.7842105263157895
In [407]:
recall_score(y_test,y_pred3)
Out[407]:
0.764102564102564

visualizing the results after after undersampling¶

In [408]:
final_data=pd.DataFrame({'Models':['log','dt','rf'],
             'ACC':[accuracy_score(y_test,y_pred1)*100,
                                               
                                                    
                    accuracy_score(y_test,y_pred2)*100,
                    accuracy_score(y_test,y_pred3)*100]}) 
In [409]:
final_data
Out[409]:
Models ACC
0 log 59.49
1 dt 70.26
2 rf 78.70
In [410]:
sns.barplot(final_data['Models'],final_data['ACC'])
Out[410]:
<AxesSubplot:xlabel='Models', ylabel='ACC'>

OVERSAMPLING USING SMOTE¶

synthetic minority over sampling¶

In [411]:
churn.head()
Out[411]:
creditscore age tenure balance numofproducts isactivemember estimatedsalary exited geography1 geography_Germany geography_Spain gender_Male
0 619.00 42.00 2.00 0.00 1.00 1.00 101348.88 1.00 0 0 0 0
1 608.00 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00 2 0 1 0
2 502.00 42.00 8.00 159660.80 3.00 0.00 113931.57 1.00 0 0 0 0
3 699.00 39.00 1.00 0.00 2.00 0.00 93826.63 0.00 0 0 0 0
5 645.00 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00 2 0 1 1
In [412]:
churn.shape
Out[412]:
(9973, 12)
In [413]:
churn['exited'].value_counts()
Out[413]:
0.00    7943
1.00    2030
Name: exited, dtype: int64
In [414]:
sns.countplot(churn['exited'])
Out[414]:
<AxesSubplot:xlabel='exited', ylabel='count'>
In [415]:
X= churn.drop('exited',axis=1)
In [416]:
X
Out[416]:
creditscore age tenure balance numofproducts isactivemember estimatedsalary geography1 geography_Germany geography_Spain gender_Male
0 619.00 42.00 2.00 0.00 1.00 1.00 101348.88 0 0 0 0
1 608.00 41.00 1.00 83807.86 1.00 1.00 112542.58 2 0 1 0
2 502.00 42.00 8.00 159660.80 3.00 0.00 113931.57 0 0 0 0
3 699.00 39.00 1.00 0.00 2.00 0.00 93826.63 0 0 0 0
5 645.00 44.00 8.00 113755.78 2.00 0.00 149756.71 2 0 1 1
... ... ... ... ... ... ... ... ... ... ... ...
9997 741.00 35.00 6.00 74371.49 1.00 0.00 99595.67 2 0 1 1
10006 659.00 36.00 6.00 123841.49 2.00 0.00 96833.00 0 0 0 1
10025 771.00 39.00 5.00 0.00 2.00 0.00 96270.64 0 0 0 1
10030 772.00 42.00 3.00 75075.31 2.00 0.00 92888.52 1 1 0 1
10031 792.00 28.00 4.00 130142.79 1.00 0.00 38190.78 0 0 0 0

9973 rows × 11 columns

In [417]:
y=churn['exited']
In [418]:
y
Out[418]:
0       1.00
1       0.00
2       1.00
3       0.00
5       1.00
        ... 
9997    0.00
10006   0.00
10025   0.00
10030   1.00
10031   0.00
Name: exited, Length: 9973, dtype: float64
In [419]:
from sklearn.model_selection import train_test_split
In [420]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=42,stratify=y)
In [421]:
X_train
Out[421]:
creditscore age tenure balance numofproducts isactivemember estimatedsalary geography1 geography_Germany geography_Spain gender_Male
5800 624.00 71.00 7.00 0.00 2.00 1.00 108841.83 0 0 0 1
5401 727.00 28.00 1.00 0.00 1.00 0.00 40357.39 2 0 1 1
2369 709.00 45.00 4.00 122917.71 1.00 1.00 11.58 1 1 0 1
7532 623.00 35.00 0.00 130557.24 1.00 1.00 47880.71 0 0 0 0
3495 587.00 35.00 3.00 83286.56 1.00 0.00 125553.52 2 0 1 0
... ... ... ... ... ... ... ... ... ... ... ...
9692 735.00 39.00 1.00 60374.98 1.00 0.00 40223.74 2 0 1 1
4820 517.00 39.00 3.00 0.00 2.00 1.00 12465.51 2 0 1 1
3082 556.00 35.00 10.00 0.00 2.00 1.00 192751.18 0 0 0 1
2558 537.00 53.00 3.00 0.00 1.00 1.00 91406.62 0 0 0 0
9301 633.00 61.00 3.00 157201.48 1.00 1.00 50368.63 0 0 0 1

7978 rows × 11 columns

In [422]:
y_train
Out[422]:
5800   0.00
5401   0.00
2369   1.00
7532   0.00
3495   0.00
       ... 
9692   0.00
4820   0.00
3082   0.00
2558   0.00
9301   0.00
Name: exited, Length: 7978, dtype: float64
In [423]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
In [424]:
minmax = MinMaxScaler()
In [425]:
X_train=minmax.fit_transform(X_train)
In [426]:
X_test=minmax.fit_transform(X_test)
In [427]:
X_train
Out[427]:
array([[0.548     , 0.71621622, 0.7       , ..., 0.        , 0.        ,
        1.        ],
       [0.754     , 0.13513514, 0.1       , ..., 0.        , 1.        ,
        1.        ],
       [0.718     , 0.36486486, 0.4       , ..., 1.        , 0.        ,
        1.        ],
       ...,
       [0.412     , 0.22972973, 1.        , ..., 0.        , 0.        ,
        1.        ],
       [0.374     , 0.47297297, 0.3       , ..., 0.        , 0.        ,
        0.        ],
       [0.566     , 0.58108108, 0.3       , ..., 0.        , 0.        ,
        1.        ]])
In [428]:
from imblearn.over_sampling  import SMOTE
In [429]:
SMOTE().fit_resample(X,y)
Out[429]:
(       creditscore   age  tenure   balance  numofproducts  isactivemember  \
 0           619.00 42.00    2.00      0.00           1.00            1.00   
 1           608.00 41.00    1.00  83807.86           1.00            1.00   
 2           502.00 42.00    8.00 159660.80           3.00            0.00   
 3           699.00 39.00    1.00      0.00           2.00            0.00   
 4           645.00 44.00    8.00 113755.78           2.00            0.00   
 ...            ...   ...     ...       ...            ...             ...   
 15881       757.07 60.50    7.98 125438.21           1.05            0.00   
 15882       579.83 49.40    2.35 100032.41           1.00            0.41   
 15883       601.86 43.62    2.57      0.00           1.57            0.57   
 15884       571.98 39.63    4.00      0.00           1.00            0.00   
 15885       607.70 45.60    4.02      0.00           1.49            0.00   
 
        estimatedsalary  geography1  geography_Germany  geography_Spain  \
 0            101348.88           0                  0                0   
 1            112542.58           2                  0                1   
 2            113931.57           0                  0                0   
 3             93826.63           0                  0                0   
 4            149756.71           2                  0                1   
 ...                ...         ...                ...              ...   
 15881        182223.25           0                  0                0   
 15882          5928.03           1                  0                0   
 15883        174137.68           0                  0                0   
 15884         95434.59           0                  0                0   
 15885         92236.98           1                  0                0   
 
        gender_Male  
 0                0  
 1                0  
 2                0  
 3                0  
 4                1  
 ...            ...  
 15881            0  
 15882            1  
 15883            0  
 15884            0  
 15885            0  
 
 [15886 rows x 11 columns],
 0       1.00
 1       0.00
 2       1.00
 3       0.00
 4       1.00
         ... 
 15881   1.00
 15882   1.00
 15883   1.00
 15884   1.00
 15885   1.00
 Name: exited, Length: 15886, dtype: float64)
In [430]:
X_res,y_res=SMOTE().fit_resample(X,y)
In [431]:
y_res.value_counts()
Out[431]:
1.00    7943
0.00    7943
Name: exited, dtype: int64
In [432]:
X_train,X_test,y_train,y_test=train_test_split(X_res,y_res,test_size=0.20,random_state=42,)
In [433]:
from sklearn.linear_model import LogisticRegression
In [434]:
log=LogisticRegression()
In [435]:
log.fit(X_train,y_train)
Out[435]:
LogisticRegression()
In [436]:
y_pred1=log.predict(X_test)
In [437]:
from sklearn.metrics import accuracy_score
In [438]:
accuracy_score(y_test,y_pred1)
Out[438]:
0.6784140969162996
In [439]:
from sklearn.metrics import precision_score,recall_score,f1_score
In [440]:
precision_score(y_test,y_pred1)
Out[440]:
0.6680722891566265
In [441]:
recall_score(y_test,y_pred1)
Out[441]:
0.7018987341772152
In [442]:
f1_score(y_test,y_pred1)
Out[442]:
0.6845679012345679

SVC¶

In [443]:
from sklearn import svm
In [444]:
svm=svm.SVC()
In [445]:
svm.fit(X_train,y_train)
Out[445]:
SVC()
In [446]:
y_pred2=svm.predict(X_test)
In [447]:
accuracy_score(y_test,y_pred2)
Out[447]:
0.5670232850849591
In [448]:
precision_score(y_test,y_pred2)
Out[448]:
0.5473098330241187
In [449]:
f1_score(y_test,y_pred2)
Out[449]:
0.6316916488222698

KNeighbors Classifier¶

In [450]:
from sklearn.neighbors import KNeighborsClassifier
In [451]:
knn=KNeighborsClassifier()
In [452]:
knn.fit(X_train,y_train)
Out[452]:
KNeighborsClassifier()
In [453]:
y_pred3=knn.predict(X_test)
In [454]:
accuracy_score(y_test,y_pred3)
Out[454]:
0.6768407803650094
In [455]:
precision_score(y_test,y_pred3)
Out[455]:
0.6469962785752259
In [456]:
recall_score(y_test,y_pred3)
Out[456]:
0.770253164556962
In [457]:
f1_score(y_test,y_pred3)
Out[457]:
0.7032649523259173

Decision Tree Classifier¶

In [458]:
from sklearn.tree import DecisionTreeClassifier
In [459]:
dt=DecisionTreeClassifier()
In [460]:
dt.fit(X_train,y_train)
Out[460]:
DecisionTreeClassifier()
In [461]:
y_pred4=dt.predict(X_test)
In [462]:
accuracy_score(y_test,y_pred4)
Out[462]:
0.8442416614222782
In [463]:
precision_score(y_test,y_pred4)
Out[463]:
0.84098051539912
In [464]:
recall_score(y_test,y_pred4)
Out[464]:
0.8468354430379746
In [465]:
f1_score(y_test,y_pred4)
Out[465]:
0.8438978240302742

Random Forest Classifier¶

In [466]:
from sklearn.ensemble import RandomForestClassifier
In [467]:
rf=RandomForestClassifier()
In [468]:
rf.fit(X_train,y_train)
Out[468]:
RandomForestClassifier()
In [469]:
y_pred5=rf.predict(X_test)
In [470]:
accuracy_score(y_test,y_pred5)
Out[470]:
0.89490245437382
In [471]:
precision_score(y_test,y_pred5)
Out[471]:
0.9192462987886945
In [472]:
recall_score(y_test,y_pred5)
Out[472]:
0.8645569620253165
In [473]:
f1_score(y_test,y_pred5)
Out[473]:
0.8910632746249184

Gradient Boosting Classifier¶

In [474]:
from sklearn.ensemble import GradientBoostingClassifier
In [475]:
gb=GradientBoostingClassifier()
In [476]:
gb.fit(X_train,y_train)
Out[476]:
GradientBoostingClassifier()
In [477]:
y_pred6=gb.predict(X_test)
In [478]:
accuracy_score(y_test,y_pred6)
Out[478]:
0.8926998112020138
In [479]:
precision_score(y_test,y_pred6)
Out[479]:
0.9166106254203094
In [480]:
recall_score(y_test,y_pred6)
Out[480]:
0.8626582278481013
In [481]:
f1_score(y_test,y_pred6)
Out[481]:
0.8888164329964134

SAVE THE MODEL¶

In [482]:
final_data1=pd.DataFrame({'Models':['log','svc','knn','dt','rf','gb'],
             'ACC':[accuracy_score(y_test,y_pred1)*100,
                    accuracy_score(y_test,y_pred2)*100,                           
                    accuracy_score(y_test,y_pred3)*100,                                
                    accuracy_score(y_test,y_pred4)*100,
                    accuracy_score(y_test,y_pred5)*100,
                    accuracy_score(y_test,y_pred6)*100]}) 
In [483]:
final_data1
Out[483]:
Models ACC
0 log 67.84
1 svc 56.70
2 knn 67.68
3 dt 84.42
4 rf 89.49
5 gb 89.27
In [484]:
sns.barplot(final_data1['Models'],final_data1['ACC'])
Out[484]:
<AxesSubplot:xlabel='Models', ylabel='ACC'>
In [485]:
final_data2=pd.DataFrame({'Models':['log','svc','knn','dt','rf','gb'],
             'PRE':[precision_score(y_test,y_pred1)*100,
                    precision_score(y_test,y_pred2)*100,                           
                    precision_score(y_test,y_pred3)*100,                                
                    precision_score(y_test,y_pred4)*100,
                    precision_score(y_test,y_pred5)*100,
                    precision_score(y_test,y_pred6)*100]}) 
In [486]:
final_data2
Out[486]:
Models PRE
0 log 66.81
1 svc 54.73
2 knn 64.70
3 dt 84.10
4 rf 91.92
5 gb 91.66
In [487]:
sns.barplot(final_data2['Models'],final_data2['PRE'])
Out[487]:
<AxesSubplot:xlabel='Models', ylabel='PRE'>
In [488]:
final_data3=pd.DataFrame({'Models':['log','svc','knn','dt','rf','gb'],
             'REC':[recall_score(y_test,y_pred1)*100,
                    recall_score(y_test,y_pred2)*100,                           
                    recall_score(y_test,y_pred3)*100,                                
                    recall_score(y_test,y_pred4)*100,
                    recall_score(y_test,y_pred5)*100,
                    recall_score(y_test,y_pred6)*100]}) 
In [489]:
final_data3
Out[489]:
Models REC
0 log 70.19
1 svc 74.68
2 knn 77.03
3 dt 84.68
4 rf 86.46
5 gb 86.27
In [490]:
sns.barplot(final_data3['Models'],final_data3['REC'])
Out[490]:
<AxesSubplot:xlabel='Models', ylabel='REC'>
In [491]:
final_data4=pd.DataFrame({'Models':['log','svc','knn','dt','rf','gb'],
             'F1':[f1_score(y_test,y_pred1)*100,
                    f1_score(y_test,y_pred2)*100,                           
                    f1_score(y_test,y_pred3)*100,                                
                    f1_score(y_test,y_pred4)*100,
                    f1_score(y_test,y_pred5)*100,
                    f1_score(y_test,y_pred6)*100]}) 
In [492]:
final_data4
Out[492]:
Models F1
0 log 68.46
1 svc 63.17
2 knn 70.33
3 dt 84.39
4 rf 89.11
5 gb 88.88
In [493]:
sns.barplot(final_data4['Models'],final_data4['F1'])
Out[493]:
<AxesSubplot:xlabel='Models', ylabel='F1'>
In [494]:
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

DATA SCIENCE DEVELOPER¶

SAVING AND DEVELOPING THE MODEL AND DEPLOYMENT¶

In [495]:
X_res=minmax.fit_transform(X_res)
In [496]:
rf.fit(X_res,y_res)
Out[496]:
RandomForestClassifier()
In [497]:
import joblib
In [498]:
joblib.dump(rf,'churn_predict_model')
Out[498]:
['churn_predict_model']
In [499]:
model=joblib.load('churn_predict_model')
In [500]:
churn.columns
Out[500]:
Index(['creditscore', 'age', 'tenure', 'balance', 'numofproducts',
       'isactivemember', 'estimatedsalary', 'exited', 'geography1',
       'geography_Germany', 'geography_Spain', 'gender_Male'],
      dtype='object')
In [501]:
churn.head()
Out[501]:
creditscore age tenure balance numofproducts isactivemember estimatedsalary exited geography1 geography_Germany geography_Spain gender_Male
0 619.00 42.00 2.00 0.00 1.00 1.00 101348.88 1.00 0 0 0 0
1 608.00 41.00 1.00 83807.86 1.00 1.00 112542.58 0.00 2 0 1 0
2 502.00 42.00 8.00 159660.80 3.00 0.00 113931.57 1.00 0 0 0 0
3 699.00 39.00 1.00 0.00 2.00 0.00 93826.63 0.00 0 0 0 0
5 645.00 44.00 8.00 113755.78 2.00 0.00 149756.71 1.00 2 0 1 1
In [502]:
model.predict([[619,0,42,2,0.00,1.00,1.00,101348.88,1.00,0,0]])
Out[502]:
array([0.])

GUI¶

In [503]:
from tkinter import *
In [504]:
import joblib
In [505]:
from tkinter import messagebox
In [506]:
master=Tk()





master.mainloop()
In [507]:
#ADD TITLE TO IT
master=Tk()

master.title('churn_predict_model')



master.mainloop()
In [508]:
#LETS SET DIMENSION OF THIS WINDOW
master=Tk()

master.title('churn_predict_model')

master.geometry('400x300')
master.minsize(200,200)
master.maxsize(600,600)
master.mainloop()
In [509]:
#LETS ADD COLOR TO THE WINDOW

master=Tk()

master.title('churn_predict_model')

master.geometry('400x300')
master.minsize(200,200)
master.maxsize(600,600)
master.configure(bg='lightblue')







master.mainloop()
In [510]:
master=Tk()
master.title('churn_predict_model')

master.geometry('400x300')
master.minsize(200,200)
master.maxsize(600,600)
master.configure(bg='lightblue')


label=Label(master,text='Bank churn',font=('Arial',20,'bold'),width=10,height=1,bg='lightblue',foreground='yellow').grid(row=0,columnspan=2)

Label(master,text='creditscore').grid(row=1)
Label(master,text='gender[0-1]').grid(row=2)
Label(master,text='age').grid(row=3)
Label(master,text='tenure').grid(row=4)
Label(master,text='balance').grid(row=5)
Label(master,text='	numofproducts').grid(row=6)
Label(master,text='isactivemember').grid(row=7)
Label(master,text='estimatedsalary').grid(row=8)
Label(master,text='geography1').grid(row=9)
Label(master,text='geography_Germany').grid(row=10)
Label(master,text='geography_Spain').grid(row=11)
master.mainloop()
In [511]:
master=Tk()
master.title('churn_predict_model')

master.geometry('400x300')
master.minsize(200,200)
master.maxsize(600,600)
master.configure(bg='lightblue')


label=Label(master,text='Bank churn',font=('Arial',20,'bold'),width=10,height=1,bg='lightblue',foreground='yellow').grid(row=0,columnspan=2)

Label(master,text='creditscore').grid(row=1)
Label(master,text='gender[0-1]').grid(row=2)
Label(master,text='age').grid(row=3)
Label(master,text='tenure').grid(row=4)
Label(master,text='balance').grid(row=5)
Label(master,text='	numofproducts').grid(row=6)
Label(master,text='isactivemember').grid(row=7)
Label(master,text='estimatedsalary').grid(row=8)
Label(master,text='geography1').grid(row=9)
Label(master,text='geography_Germany').grid(row=10)
Label(master,text='geography_Spain').grid(row=11)


e1=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3,show='*')
e2=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3,show='*')
e3=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3,show='*')
e4=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3,show='*')
e5=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3,show='*')
e6=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3,show='*')
e7=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3,show='*')
e8=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3,show='*')
e9=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3,show='*')
e10=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3,show='*')
e11=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3,show='*')


master.mainloop()
In [512]:
master=Tk()
master.title('churn_predict_model')


master.geometry('400x300')
master.minsize(200,200)
master.maxsize(600,600)
master.configure(bg='lightblue')


label=Label(master,text='Bank churn',font=('Arial',20,'bold'),width=10,height=1,bg='lightblue',foreground='yellow').grid(row=0,columnspan=2)

Label(master,text='creditscore').grid(row=1)
Label(master,text='gender[0-1]').grid(row=2)
Label(master,text='age').grid(row=3)
Label(master,text='tenure').grid(row=4)
Label(master,text='balance').grid(row=5)
Label(master,text='	numofproducts').grid(row=6)
Label(master,text='isactivemember').grid(row=7)
Label(master,text='estimatedsalary').grid(row=8)
Label(master,text='geography1').grid(row=9)
Label(master,text='geography_Germany').grid(row=10)
Label(master,text='geography_Spain').grid(row=11)


e1=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e2=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e3=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e4=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e5=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e6=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e7=Entry( master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e8=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e9=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e10=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e11=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
  

e1.grid(row=1,column=1)    
e2.grid(row=2,column=1)
e3.grid(row=3,column=1)
e4.grid(row=4,column=1)
e5.grid(row=5,column=1)
e6.grid(row=6,column=1)
e7.grid(row=7,column=1)
e8.grid(row=8,column=1)
e9.grid(row=9,column=1)
e10.grid(row=10,column=1)
e11.grid(row=11,column=1)

master.mainloop()
In [513]:
def show_entry():
    p1=float(e1.get())
    p2=float(e2.get())
    p3=float(e3.get())
    p4=float(e4.get())
    p5=float(e5.get())
    p6=float(e6.get())
    p7=float(e6.get())
    p8=float(e6.get())
    p9=float(e6.get())
    p10=float(e6.get())
    p11=float(e6.get())

    model=joblib.load('churn_predict_model')
    result=model.predict([[p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11]])
    
    
    Label(master, text='Bank_churn').grid(row=12)
    Label(master,text=result).grid(row=13)
    
    
master=Tk()

master.title('churn_predict_model')


master.geometry('400x300')
master.minsize(200,200)
master.maxsize(600,600)
master.configure(bg='lightblue')


label=Label(master,text='Bank_churn',font=('Arial',20,'bold'),width=10,height=1,bg='lightblue',foreground='yellow').grid(row=0,columnspan=2)

Label(master,text='creditscore').grid(row=1)
Label(master,text='gender[0-1]').grid(row=2)
Label(master,text='age').grid(row=3)
Label(master,text='tenure').grid(row=4)
Label(master,text='balance').grid(row=5)
Label(master,text='	numofproducts').grid(row=6)
Label(master,text='isactivemember').grid(row=7)
Label(master,text='estimatedsalary').grid(row=8)
Label(master,text='geography1').grid(row=9)
Label(master,text='geography_Germany').grid(row=10)
Label(master,text='geography_Spain').grid(row=11)


e1=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e2=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e3=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e4=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e5=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e6=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e7=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)  
e8=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e9=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e10=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
e11=Entry(master,font=('Arial',14),bg='grey',fg='white',borderwidth=3)
  

e1.grid(row=1,column=1)    
e2.grid(row=2,column=1)
e3.grid(row=3,column=1)
e4.grid(row=4,column=1)
e5.grid(row=5,column=1)
e6.grid(row=6,column=1)
e7.grid(row=7,column=1)
e8.grid(row=8,column=1)
e9.grid(row=9,column=1)
e10.grid(row=10,column=1)
e11.grid(row=11,column=1)





Button(master,text='predict',command=show_entry,bg='pink',font=('Arial',20,'bold'),borderwidth=3,activebackground='blue').grid()







master.mainloop()
In [ ]: